Post on 01-Apr-2015
A Practical Quicksort Algorithmfor Graphics Processors
Daniel Cederman and Philippas Tsigas
Distributed Computing and SystemsChalmers University of Technology
System Model The Algorithm Experiments
Overview
5 8 3 2
3 2
2 3
5 8
5 8
System Model The Algorithm Experiments
System Model
5 8 3 2
3 2
2 3
5 8
5 8
CPU vs. GPU
Normal processors Graphics processors
Multi-core Large cache Few threads Speculative execution
Many-core Small cache Wide and fast memory
bus Thousands of threads
hides memory latency
Designed for general purpose computation Extensions to C/C++
CUDA
System Model – Global Memory
Global Memory
System Model – Shared Memory
Global Memory
Multiprocessor 0 Multiprocessor 1
SharedMemory
SharedMemory
System Model – Thread Blocks
Global Memory
Multiprocessor 0 Multiprocessor 1
SharedMemory
SharedMemory
Thread Block 0
Thread Block 1
Thread Block x
Thread Block y
Minimal, manual cache 32-word SIMD instruction Coalesced memory access No block synchronization Expensive synchronization primitives
Challenges
System Model The Algorithm Experiments
The Algorithm
5 8 3 2
3 2
2 3
5 8
5 8
Quicksort
5 8 3 2
3 2
2 3
5 8
5 8
Quicksort
5 8 3 2
3 2
2 3
5 8
5 8
Data Parallel!
Quicksort
5 8 3 2
3 2
2 3
5 8
5 8
NOTData
Parallel!
Quicksort
5 8 3 2
3 2
2 3
5 8
5 8
Phase One• Sequences divided•Block synchronization
on the CPU
Thread Block 1 Thread Block 2
Quicksort
5 8 3 2
3 2
2 3
5 8
5 8
Phase Two• Each block is
assigned own sequence
• Run entirely on GPU
Thread Block 1 Thread Block 2
Quicksort – Phase One
2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
Quicksort – Phase One
Assign Thread Blocks
2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
Quicksort – Phase One
Uses an auxiliary array In-place too expensive
2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
Fetch-And-Addon
LeftPointer
Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
LeftPointer = 0
Thread Block ONE
Thread BlockTWO
1
0
Fetch-And-Addon
LeftPointer
Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
LeftPointer = 2
Thread Block ONE
Thread BlockTWO
2
3
3 2
Fetch-And-Addon
LeftPointer
Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
LeftPointer = 4
Thread Block ONE
Thread BlockTWO
5
4
3 2 2 3
Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
LeftPointer = 10
3 2 2 3 3 0 2 3 97 7 8 78 5 7 6 9 80 2
Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
Three way partition
3 2 2 3 3 0 2 3 94 7 7 8 78 5 7 6 9 84 40 2
Two Pass Partitioning
2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
<
>
=
>
<< <
>
<
>
<
=
<
>
<
> > >
<
>
=
<
> >
= = =
Recurse on smallest sequence◦ Max depth log(n)
Explicit stack Alternative sorting
◦ Bitonic sort
Second Phase
System Model The Algorithm Experiments
The Algorithm
5 8 3 2
3 2
2 3
5 8
5 8
GPU-Quicksort Cederman and Tsigas, ESA08 GPUSort Govindaraju et. al.,
SIGMOD05 Radix-Merge Harris et. al., GPU Gems 3 ’07 Global radix Sengupta et. al., GH07 Hybrid Sintorn and Assarsson,
GPGPU07
Introsort David Musser, Software: Practice and Experience
Set of Algorithms
Uniform Gaussian Bucket Sorted Zero Stanford
Models
Distributions
Uniform Gaussian Bucket Sorted Zero Stanford
Models
Distributions
Uniform Gaussian Bucket Sorted Zero Stanford
Models
Distributions
Uniform Gaussian Bucket Sorted Zero Stanford
Models
Distributions
Uniform Gaussian Bucket Sorted Zero Stanford
Models
Distributions
Uniform Gaussian Bucket Sorted Zero Stanford
Models
Distributions
P
x1x2
x3
x4
First Phase◦ Average of maximum and minimum element
Second Phase◦ Median of first, middle and last element
Pivot selection
8800GTX◦ 16 multiprocessors (128 cores)◦ 86.4 GB/s bandwidth
8600GTS◦ 4 multiprocessors (32 cores)◦ 32 GB/s bandwidth
CPU-Reference◦ AMD Dual-Core Opteron 265 / 1.8 GHz
Hardware
8800GTX – Uniform Distribution
1M 2M 4M 8M 16M0
500
1000
1500
2000
2500
3000
3500
4000
4500
GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL
Size
Tim
e (m
s)
8800GTX – Uniform Distribution
1M 2M 4M 8M 16M0
500
1000
1500
2000
2500
3000
3500
4000
4500
GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL
Size
Tim
e (m
s)
8800GTX – Uniform – 16M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL0
500
1000
1500
2000
2500
3000
3500
4000
4500
Tim
e (m
s)
8800GTX – Uniform – 16M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL0
500
1000
1500
2000
2500
3000
3500
4000
4500
Tim
e (m
s)
CPU-Reference~4s
8800GTX – Uniform – 16M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL0
500
1000
1500
2000
2500
3000
3500
4000
4500
Tim
e (m
s)
GPUSort andRadix-Merge
~1.5s
8800GTX – Uniform – 16M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL0
500
1000
1500
2000
2500
3000
3500
4000
4500
Tim
e (m
s)
GPU-Quicksort~380ms
Global Radix~510ms
8800GTX – Sorted Distribution
1M 2M 4M 8M0
100
200
300
400
500
600
700
800
GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL
Size
Tim
e (m
s)
8800GTX – Sorted Distribution
1M 2M 4M 8M0
100
200
300
400
500
600
700
800
GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL
Size
Tim
e (m
s)
8800GTX – Sorted – 8M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL0
100
200
300
400
500
600
700
800
Tim
e (m
s)
8800GTX – Sorted – 8M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL0
100
200
300
400
500
600
700
800
Tim
e (m
s)
Reference is faster than
GPUSort and Radix-Merge
8800GTX – Sorted – 8M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL0
100
200
300
400
500
600
700
800
Tim
e (m
s)
GPU-Quicksorttakes less than
200ms
8800GTX – Zero Distribution
1M 2M 4M 8M 16M0
200
400
600
800
1000
1200
1400
1600
1800
GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL
Size
Tim
e (m
s)
8800GTX – Zero Distribution
1M 2M 4M 8M 16M0
200
400
600
800
1000
1200
1400
1600
1800
GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL
Size
Tim
e (m
s)
8800GTX – Zero – 16M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL0
200
400
600
800
1000
1200
1400
1600
1800
Tim
e (m
s)
8800GTX – Zero – 16M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL0
200
400
600
800
1000
1200
1400
1600
1800
Tim
e (m
s)
Bitonic is unaffected
by distribution
8800GTX – Zero – 16M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL0
200
400
600
800
1000
1200
1400
1600
1800
Tim
e (m
s)
Three-way partition
8600GTS – Uniform Distribution
1M 2M 4M 8M0
500
1000
1500
2000
2500
GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybrid
Size
Tim
e (m
s)
8600GTS – Uniform Distribution
1M 2M 4M 8M0
500
1000
1500
2000
2500
GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybrid
Size
Tim
e (m
s)
8600GTS – Uniform – 8M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL Hybrid0
500
1000
1500
2000
2500
Tim
e (m
s)
8600GTS – Uniform – 8M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL Hybrid0
500
1000
1500
2000
2500
Tim
e (m
s)
Reference equal to
Radix-Merge
8600GTS – Uniform – 8M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL Hybrid0
500
1000
1500
2000
2500
Tim
e (m
s)
Quicksort and hybrid ~4 times
faster
8600GTS – Sorted Distribution
1M 2M 4M 8M0
500
1000
1500
2000
2500
3000
3500
4000
4500
GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybridHybrid (Random)
Size
Tim
e (m
s)
8600GTS – Sorted Distribution
1M 2M 4M 8M0
500
1000
1500
2000
2500
3000
3500
4000
4500
GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybridHybrid (Random)
Size
Tim
e (m
s)
8600GTS – Sorted – 8M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL Hybrid Hybrid (Random)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Tim
e (m
s)
8600GTS – Sorted – 8M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL Hybrid Hybrid (Random)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Tim
e (m
s)
Hybrid needs randomization
8600GTS – Sorted – 8M
GPU-Quicksort
Global Radix GPUSort Radix-Merge STL Hybrid Hybrid (Random)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Tim
e (m
s)
Reference better
Manuscript (2.2M) Dragon (3.6M) Statuette (5M)0
100200300400500600700800900
1000
GPU-Quicksort Global Radix GPUSortRadix/Merge STL
Tim
e (m
s)Visibility Ordering – 8800GTX
Manuscript (2.2M) Dragon (3.6M) Statuette (5M)0
100200300400500600700800900
1000
GPU-Quicksort Global Radix GPUSortRadix/Merge STL
Tim
e (m
s)Visibility Ordering – 8800GTX
Similar to uniform
distribution
Manuscript (2.2M) Dragon (3.6M) Statuette (5M)0
100200300400500600700800900
1000
GPU-Quicksort Global Radix GPUSortRadix/Merge STL
Tim
e (m
s)Visibility Ordering – 8800GTX
Sequence needs to be a power of
two in length
Minimal, manual cache◦ Used only for prefix sum and bitonic
32-word SIMD instruction◦ Main part executes same instructions
Coalesced memory access◦ All reads coalesced
No block synchronization◦ Only required in first phase
Expensive synchronization primitives◦ Two passes amortizes cost
Conclusions
Quicksort is a viable sorting method for graphics processors and can be implemented in a data parallel way
It is competitive
Conclusions
Thank you!
www.cs.chalmers.se/~cederman