Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of...

A Practical Quicksort Algorithmfor Graphics Processors

Daniel Cederman and Philippas Tsigas

Distributed Computing and SystemsChalmers University of Technology

System Model The Algorithm Experiments

Overview

5 8 3 2

System Model

5 8 3 2

CPU vs. GPU

Normal processors Graphics processors

Multi-core Large cache Few threads Speculative execution

Many-core Small cache Wide and fast memory

bus Thousands of threads

hides memory latency

Designed for general purpose computation Extensions to C/C++

System Model – Global Memory

Global Memory

System Model – Shared Memory

Global Memory

Multiprocessor 0 Multiprocessor 1

SharedMemory

System Model – Thread Blocks

Global Memory

Multiprocessor 0 Multiprocessor 1

SharedMemory

Thread Block 0

Thread Block 1

Thread Block x

Thread Block y

Minimal, manual cache 32-word SIMD instruction Coalesced memory access No block synchronization Expensive synchronization primitives

Challenges

The Algorithm

5 8 3 2

Quicksort

5 8 3 2

Quicksort

5 8 3 2

Data Parallel!

Quicksort

5 8 3 2

NOTData

Parallel!

Quicksort

5 8 3 2

Phase One• Sequences divided•Block synchronization

on the CPU

Thread Block 1 Thread Block 2

Quicksort

5 8 3 2

Phase Two• Each block is

assigned own sequence

• Run entirely on GPU

Thread Block 1 Thread Block 2

Quicksort – Phase One

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Assign Thread Blocks

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Uses an auxiliary array In-place too expensive

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Fetch-And-Addon

LeftPointer

Quicksort – Synchronization2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

LeftPointer = 0

Thread Block ONE

Thread BlockTWO

Fetch-And-Addon

LeftPointer

LeftPointer = 2

Thread Block ONE

Thread BlockTWO

Fetch-And-Addon

LeftPointer

LeftPointer = 4

Thread Block ONE

Thread BlockTWO

3 2 2 3

LeftPointer = 10

3 2 2 3 3 0 2 3 97 7 8 78 5 7 6 9 80 2

Three way partition

3 2 2 3 3 0 2 3 94 7 7 8 78 5 7 6 9 84 40 2

Two Pass Partitioning

2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8

Recurse on smallest sequence◦ Max depth log(n)

Explicit stack Alternative sorting

◦ Bitonic sort

Second Phase

The Algorithm

5 8 3 2

GPU-Quicksort Cederman and Tsigas, ESA08 GPUSort Govindaraju et. al.,

SIGMOD05 Radix-Merge Harris et. al., GPU Gems 3 ’07 Global radix Sengupta et. al., GH07 Hybrid Sintorn and Assarsson,

GPGPU07

Introsort David Musser, Software: Practice and Experience

Set of Algorithms

Uniform Gaussian Bucket Sorted Zero Stanford

Models

Distributions

Models

Distributions

Models

Distributions

Models

Distributions

Models

Distributions

Models

Distributions

First Phase◦ Average of maximum and minimum element

Second Phase◦ Median of first, middle and last element

Pivot selection

8800GTX◦ 16 multiprocessors (128 cores)◦ 86.4 GB/s bandwidth

8600GTS◦ 4 multiprocessors (32 cores)◦ 32 GB/s bandwidth

CPU-Reference◦ AMD Dual-Core Opteron 265 / 1.8 GHz

Hardware

8800GTX – Uniform Distribution

1M 2M 4M 8M 16M0

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTL

8800GTX – Uniform Distribution

1M 2M 4M 8M 16M0

8800GTX – Uniform – 16M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL0

GPU-Quicksort

CPU-Reference~4s

GPU-Quicksort

GPUSort andRadix-Merge

GPU-Quicksort

GPU-Quicksort~380ms

Global Radix~510ms

8800GTX – Sorted Distribution

1M 2M 4M 8M0

8800GTX – Sorted Distribution

1M 2M 4M 8M0

8800GTX – Sorted – 8M

GPU-Quicksort

Reference is faster than

GPUSort and Radix-Merge

GPU-Quicksort

GPU-Quicksorttakes less than

8800GTX – Zero Distribution

1M 2M 4M 8M 16M0

8800GTX – Zero Distribution

1M 2M 4M 8M 16M0

8800GTX – Zero – 16M

GPU-Quicksort

Bitonic is unaffected

by distribution

GPU-Quicksort

Three-way partition

8600GTS – Uniform Distribution

1M 2M 4M 8M0

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybrid

8600GTS – Uniform Distribution

1M 2M 4M 8M0

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybrid

8600GTS – Uniform – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid0

GPU-Quicksort

Reference equal to

Radix-Merge

GPU-Quicksort

Quicksort and hybrid ~4 times

faster

8600GTS – Sorted Distribution

1M 2M 4M 8M0

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybridHybrid (Random)

8600GTS – Sorted Distribution

1M 2M 4M 8M0

GPU-QuicksortGlobal RadixGPUSortRadix-MergeSTLHybridHybrid (Random)

8600GTS – Sorted – 8M

GPU-Quicksort

Global Radix GPUSort Radix-Merge STL Hybrid Hybrid (Random)

GPU-Quicksort

Hybrid needs randomization

GPU-Quicksort

Reference better

Manuscript (2.2M) Dragon (3.6M) Statuette (5M)0

100200300400500600700800900

GPU-Quicksort Global Radix GPUSortRadix/Merge STL

s)Visibility Ordering – 8800GTX

100200300400500600700800900

Similar to uniform

distribution

100200300400500600700800900

Sequence needs to be a power of

two in length

Minimal, manual cache◦ Used only for prefix sum and bitonic

32-word SIMD instruction◦ Main part executes same instructions

Coalesced memory access◦ All reads coalesced

No block synchronization◦ Only required in first phase

Expensive synchronization primitives◦ Two passes amortizes cost

Conclusions

Quicksort is a viable sorting method for graphics processors and can be implemented in a data parallel way

It is competitive

Conclusions

Thank you!

www.cs.chalmers.se/~cederman

Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of...

Documents

Transcript of Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of...

DISTRIBUTED SYSTEMS II REPLICATION CNT. II Prof Philippas Tsigas Distributed Computing and Systems Research Group.

DISTRIBUTED SYSTEMS II REPLICATION Prof Philippas Tsigas Distributed Computing and Systems Research Group.

DISTRIBUTED SYSTEMS II FAULT-TOLERANT BROADCAST (CNT.) Prof Philippas Tsigas Distributed Computing and Systems Research Group.

The Enlightenment Newsletter - Nichiren Seattleseattlebuddhist.org/wp-content/uploads/2013/11/Enlightenment_March...The Enlightenment Newsletter Priest’s Message from Kanjin Cederman

Self-tuning Reactive Distributed Trees for Counting and Balancing Phuong Hoai Ha Marina Papatriantafilou Philippas Tsigas OPODIS ’04, Grenoble, France.

5207&5201 TROUBLE CREEK RD - LoopNetimages4.loopnet.com/d2/UyAN_Dkw-hVJz86bVK3UJ0i5YXcQpddcetTi6gp4ZQM/...Eric Cederman Cederman Properties 3060 Alternate 19, Ste B16 ,Palm Harbor

Practical and Lock-Free Doubly Linked Lists Håkan Sundell Philippas Tsigas.

PROGRAMMING MULTI-CORE AND MANY …tsigas/papers/Lock-free Concurrent Data... · CONTENTS 1 Lock-free Concurrent Data Structures 1 Daniel Cederman, Anders Gidenstam, Phuong Ha, H˚akan

DISTRIBUTED SYSTEMS II AGREEMENT - COMMIT (2-3 PHASE COMMIT) Prof Philippas Tsigas Distributed Computing and Systems Research Group.

Parallel Programming Philippas Tsigas Chalmers University of Technology Computer Science and Engineering Department © Philippas Tsigas.

DISTRIBUTED SYSTEMS II A POLYNOMIAL LOCAL SOLUTION TO MUTUAL EXCLUSION Prof Philippas Tsigas Distributed Computing and Systems Research Group.

DISTRIBUTED SYSTEMS II FAULT-TOLERANT BROADCAST Prof Philippas Tsigas Distributed Computing and Systems Research Group.

by Thomas W. Hertel and Marinos E. Tsigas I. … · by Thomas W. Hertel and Marinos E. Tsigas ... We deal in turn with production, ... The difference between this and the cif-based

On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.

Reactive Multi-word Synchronization for Multiprocessors · Reactive Multi-word Synchronization for Multiprocessors∗ Phuong Hoai Ha phuong@cs.chalmers.se Philippas Tsigas tsigas@cs.chalmers.se

DISTRIBUTED SYSTEMS II REPLICATION –QUORUM CONSENSUS Prof Philippas Tsigas Distributed Computing and Systems Research Group.

DISTRIBUTED SYSTEMS II AGREEMENT (2-3 PHASE COM.) Prof Philippas Tsigas Distributed Computing and Systems Research Group.

Lars-Erik Cederman and Luc Girardin

1 Konfliktforschung II: Herausforderungen und Lösungen gegenwärtiger Konflikte 6. Woche: Terrorismus Prof. Dr. Lars-Erik Cederman Eidgenössische Technische.

Property Information 9 Cederman Drive, Kaiteriteri