Comp422 2011 Lecture19 Sorting
-
Upload
askbilladdmicrosoft -
Category
Documents
-
view
404 -
download
3
Transcript of Comp422 2011 Lecture19 Sorting
![Page 1: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/1.jpg)
John Mellor-Crummey
Department of Computer ScienceRice University
Parallel Sorting
COMP 422 Lecture 19 5 April 2011
![Page 2: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/2.jpg)
Topics for Today
• Introduction
• Issues in parallel sorting
• Sorting networks and Batcher’s bitonic sort
• Bubble sort and odd-even transposition sort
• Parallel quicksort
2
![Page 3: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/3.jpg)
3
Why Study Parallel Sorting?
• One of the most common operations performed
• Close relation to task of routing on parallel computers—e.g. HPC Challenge RandomAccess benchmark
![Page 4: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/4.jpg)
4
Sorting Algorithm Attributes
• Internal vs. external—internal: data fits in memory—external: uses tape or disk
• Comparison-based or not—comparison sort
– basic operation: compare elements and exchange as necessary– Θ(n log n) comparisons to sort n numbers
—non-comparison-based sort– e.g. radix sort based on the binary representation of data– Θ(n) operations to sort n numbers
• Parallel vs. sequential
![Page 5: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/5.jpg)
5
Parallel Sorting Is Intrinsically Interesting
Different algorithms for different architecture variants
• Abstract parallel architecture—PRAM
• Network topology—hypercube—mesh
• Communication mechanism—shared address space—message passing
Today’s focus: parallel comparison-based sorting
![Page 6: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/6.jpg)
6
Parallel Sorting Basics
• Where are the input and output lists stored? —we assume that both input and output lists are distributed
• What is a parallel sorted sequence? —sequence partitioned among the processors—each processor’s sub-sequence is sorted —all in Pj's sub-sequence < all in Pk's sub-sequence if j < k
– the best process numbering can depend on network topology
![Page 7: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/7.jpg)
7
When partitioning is one element per process
1. Processes Pj and Pk send their elements to each other
Each process now has both elements
2. Process Pj keeps min(aj,ak), and Pk keeps max(aj, ak)
Pj Pk
aj ak
Element-wise Parallel Compare-Exchange
Pj Pk
aj, ak ak, aj
Pj Pk
min(aj, ak) max(ak, aj)
[communication step]
[comparison step]
![Page 8: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/8.jpg)
8
Bulk Parallel Compare-Split
1. Send block of size n/p to partner
2. Each partner now has both blocks
3. Merge received block with own block
4. Retain only the appropriate half of the merged block Pi retains smaller values; process Pj retains larger values
![Page 9: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/9.jpg)
9
Basic Analysis
• Assumptions—Pi and Pj are neighbors—communication channels are bi-directional
• Elementwise compare-exchange: 1 element per processor— time = ts + tw
• Bulk compare-split: n/p elements per processor —after compare-split on pair of processors Pi and Pj, i < j
– smaller n/p elements are at processor Pi – larger n/p elements at Pj
— time = ts+ twn/p– merge in O(n/p) time, as long as partial lists are sorted
![Page 10: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/10.jpg)
10
Sorting Network
• Network of comparators designed for sorting
• Comparator : two inputs x and y; two outputs x' and y’—types
– increasing (denoted ⊕): x' = min(x,y) and y' = max(x,y)
– decreasing (denoted Ө) : x' = max(x,y) and y' = min(x,y)
• Sorting network speed is proportional to its depth
x min(x,y)
y max(x,y)
⊕
⊕
x max(x,y)
y min(x,y)
Ө
Ө
![Page 11: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/11.jpg)
11
Sorting Networks
• Network structure: a series of columns
• Each column consists of a vector of comparators (in parallel)
• Sorting network organization:
![Page 12: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/12.jpg)
12
Example: Bitonic Sorting Network
• Bitonic sequence—two parts: increasing and decreasing
– 〈1,2,4,7,6,0〉: first increases and then decreases (or vice versa)—cyclic rotation of a bitonic sequence is also considered bitonic
– 〈8,9,2,1,0,4〉: cyclic rotation of 〈0,4,8,9,2,1〉
• Bitonic sorting network—sorts n elements in Θ(log2 n) time—network kernel: rearranges a bitonic sequence into a sorted one
![Page 13: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/13.jpg)
13
Bitonic Split
• Let s = 〈a0,a1,…,an-1〉 be a bitonic sequence such that—a0 ≤ a1 ≤ ··· ≤ an/2-1 , and
—an/2 ≥ an/2+1 ≥ ··· ≥ an-1
• Consider the following subsequences of s
s1 = 〈min(a0,an/2),min(a1,an/2+1),…,min(an/2-1,an-1)〉
s2 = 〈max(a0,an/2),max(a1,an/2+1),…,max(an/2-1,an-1)〉
• Sequence properties—s1 and s2 are both bitonic —∀x ∀y x ∈ s1, y ∈ s2 , x < y
• Apply recursively on s1 and s2 to produce a sorted sequence
• Works for any bitonic sequence, even if |s1| ≠ |s2|
![Page 14: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/14.jpg)
Splitting Bitonic Sequences - I
14min max
Sequence propertiess1 and s2 are both bitonic ∀x ∀y x ∈ s1, y ∈ s2 , x < y
![Page 15: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/15.jpg)
Splitting Bitonic Sequences - II
15min max
Sequence propertiess1 and s2 are both bitonic ∀x ∀y x ∈ s1, y ∈ s2 , x < y
![Page 16: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/16.jpg)
16
Bitonic Merge
Sort a bitonic sequence through a series of bitonic splits
Example: use bitonic merge to sort 16-element bitonic sequence
How: perform a series of log2 16 = 4 bitonic splits
![Page 17: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/17.jpg)
17
Sorting via Bitonic Merging Network
• Sorting network can implement bitonic merge algorithm —bitonic merging network
• Network structure—log2 n columns—each column
– n/2 comparators – performs one step of the bitonic merge
• Bitonic merging network with n inputs: ⊕BM[n]—yields increasing output sequence
• Replacing ⊕ comparators by Ө comparators: ӨBM[n]—yields decreasing output sequence
![Page 18: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/18.jpg)
18
Bitonic Merging Network, ⊕ BM[16]
• Input: bitonic sequence— input wires are numbered 0,1,…, n - 1 (shown in binary)
• Output: sequence in sorted order
• Each column of comparators is drawn separately
![Page 19: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/19.jpg)
19
Bitonic Sort
How do we sort an unsorted sequence using a bitonic merge?
Two steps
• Build a bitonic sequence
• Sort it using a bitonic merging network
![Page 20: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/20.jpg)
20
Building a Bitonic Sequence
• Build a single bitonic sequence from the given sequence —any sequence of length 2 is a bitonic sequence. —build bitonic sequence of length 4
– sort first two elements using ⊕BM[2] – sort next two using ӨBM[2]
• Repeatedly merge to generate larger bitonic sequences—⊕BM[k] & ӨBM[k]: bitonic merging networks of size k
![Page 21: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/21.jpg)
21
Building a Bitonic Sequence
Input: sequence of 16 unordered numbers
Output: a bitonic sequence of 16 numbers
![Page 22: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/22.jpg)
22
Bitonic Sort, n = 16
• First 3 stages create bitonic sequence input to stage 4
• Last stage (⊕BM[16]) yields sorted sequence
![Page 23: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/23.jpg)
23
Complexity of Bitonic Sorting Networks
• Depth of the network is Θ(log2 n)—log2 n merge stages—jth merge stage is log2 2j = j
—depth =
• Each stage of the network contains n/2 comparators
• Complexity of serial implementation = Θ(n log2 n)€
log2 2j
j=1
log2 n
∑ = ji=1
log2 n
∑ = (log2 n +1)(log2 n) /2 = θ(log2 n)
![Page 24: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/24.jpg)
24
Mapping Bitonic Sort to a Hypercube
Consider one item per processor
• How do we map wires in bitonic network onto a hypercube?
• In earlier examples—compare-exchange between two wires when labels differ in 1 bit
• Direct mapping of wires to processors—all communication is nearest neighbor
![Page 25: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/25.jpg)
25
Mapping Bitonic Merge to a Hypercube
Communication during the last merge stage of bitonic sort
• Each number is mapped to a hypercube node
• Each connection represents a compare-exchange
![Page 26: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/26.jpg)
26
Mapping Bitonic Sort to Hypercubes
Communication in bitonic sort on a hypercube
• Processes communicate along dims shown in each stage
• Algorithm is cost optimal w.r.t. its serial counterpart
• Not cost optimal w.r.t. the best sorting algorithm
![Page 27: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/27.jpg)
Batcher’s Bitonic Sort in NESL
function merge(a) =if (#a == 1) then aelse let halves = bottop(a); mins = {min(x, y) : x in halves[0]; y in halves[1]}; maxs = {max(x, y) : x in halves[0]; y in halves[1]}; in flatten({merge(x) : x in [mins,maxs]});
function bitonic_sort(a) =if (#a == 1) then aelse let b = {bitonic_sort(x) : x in bottop(a)}; in merge(b[0]++reverse(b[1]));bitonic_sort([2, 3, -7, 6, 5, 22, -8, 12]);
27Try it at: http://www.cs.rice.edu/~johnmc/nesl.html
![Page 28: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/28.jpg)
28
Bubble Sort and Variants
Sequential bubble sort algorithm
Compares and exchanges adjacent elements sequence
![Page 29: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/29.jpg)
29
Bubble Sort and Variants
• Bubble sort complexity: Θ(n2)
• Difficult to parallelize—algorithm has no concurrency
• A simple variant uncovers concurrency
![Page 30: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/30.jpg)
30
Sequential Odd-Even Transposition Sort
![Page 31: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/31.jpg)
31
Odd-Even Transposition Sort, n = 8
In each phase, n = 8 elements are compared
![Page 32: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/32.jpg)
32
Odd-Even Transposition Sort
• After n phases of odd-even exchanges, sequence is sorted
• Each phase of algorithm requires Θ(n) comparisons
• Serial complexity is Θ(n2)
![Page 33: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/33.jpg)
33
Parallel Odd-Even Transposition
Consider one item per processor
• n iterations—in each iteration, each processor does one compare-exchange
• Parallel run time of this formulation is Θ(n)
• Cost optimal with respect to the base serial algorithm —but not the optimal one!
![Page 34: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/34.jpg)
34
Parallel Odd-Even Transposition Sort
note: if partner id < 1 or > n, then skip compare
![Page 35: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/35.jpg)
35
Quicksort
• Popular sequential sorting algorithm —simplicity, low overhead, optimal average complexity
• Operation—select an entry in the sequence to be the pivot —divide the sequence into two halves
– one with all elements less than the pivot – other greater
• Apply process recursively to each of sublist
![Page 36: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/36.jpg)
36
Parallelizing Quicksort
• First, recursive decomposition —partition the list serially —handle each subproblems on a different processor
• Time for this algorithm is lower-bounded by Ω(n)!
• Can we parallelize the partitioning step? —can we use n processors to partition a list of length n around a
pivot in O(1) time?
• Tricky on real machines
![Page 37: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/37.jpg)
37
Practical Parallel Quicksort
Each processor initially responsible for n/p elements
• Shared memory formulation—select first pivot & broadcast—each processor partitions own data—globally rearrange data into smaller and larger parts (in place)—recurse with proportional # processors on each part
• Message passing formulation—partitioning
– each processor first partitions local portion of array– determine which processes will be responsible for each partition– (based on size of smaller than pivot and larger than pivot groups)– divide up the data among the processor subsets responsible for each part
—continue recursively
![Page 38: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/38.jpg)
Data Parallel Quicksort in NESL
• Total work is O(n log n)• Recursion depth is O(log n)• Depth of each operation is constant 38
12345678
• Total depth is O(log n) as well
![Page 39: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/39.jpg)
39
Other Sorting Algorithms• Shellsort - another variant of bubble sort
—two stage process– log p rounds of long distance exchanges– followed by rounds of odd-even transposition sort until done
—key idea: long distance moves of first stage reduce number of rounds necessary in second stage
• Radix sort : in a series of rounds, sort elements into buckets by digit
• Bucket and sample sort—assumes evenly distributed items in an interval—buckets represent evenly-sized subintervals
• Enumeration sort: —determine rank of each element—place it in the correct position—CRCW PRAM algorithm: n2 processors, sort in Θ(1) time
– assumes that all concurrent writes to a location deposit sum n processes in column j test element j against the rest; write 1 into C[j]
– place A[j] into A[C[j]]
![Page 40: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/40.jpg)
Other Sorting Algorithms (Cont)
• Histogram Sorting —goal: divide keys into p evenly sized pieces—use an iterative approach to do so—initiating processor broadcasts k > p-1 splitter guesses—each processor determines how many keys fall in each bin—sum histogram with global reduction—one processor examines guesses to see which are satisfactory—broadcast finalized splitters and number of keys for each
processor—each processor sends local data to appropriate processors using
all-to-all communication—each processor merges chunks it receives— Kale and Solomonik improved this (IPDPS 2010)
40
![Page 41: Comp422 2011 Lecture19 Sorting](https://reader030.fdocuments.in/reader030/viewer/2022020206/547c0bd3b4af9faf158b4fd8/html5/thumbnails/41.jpg)
41
References
• Adapted from slides “Sorting” by Ananth Grama
• Based on Chapter 9 of “Introduction to Parallel Computing” by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Addison Wesley, 2003
• “Programming Parallel Algorithms.” Guy Blelloch. Communications of the ACM, volume 39, number 3, March 1996.
• http://www.cs.cmu.edu/~scandal/nesl/algorithms.html#sort
• Edgar Solomonik and Laxmikant V. Kale. Highly Scalable Parallel Sorting. Proceedings of IPDPS 2010.