Parallel Algorithms - School of Computing · 2018-02-22 · Bucket sort Assume input is uniformly...
Transcript of Parallel Algorithms - School of Computing · 2018-02-22 · Bucket sort Assume input is uniformly...
Parallel AlgorithmsPART 2
Last time …
Introduction to Parallel Algorithms
Complexity analysis
Work/Depth model
Prefix Sum, Parallel Select
Questions?
Parallel Select
Select numbers < pivot
𝐴 ← [1 2 3 0 4 0 2 3 0 1 3 4]
pivot ← 2
[1 0 0 1 0 1 0 0 1 1 0 0]
[1 1 1 2 2 3 3 3 4 5 5 5]
Parallel Select the 𝑎𝑖 < pv
[l,m] ← select_lower (a, n, pv)
// t = t[0,…,n-1]
parfor (i=0; i<n; ++i) t[i] ← a[i] < pv;
s ← scan (t); m ← s(n-1);
parfor (i=0; i<n; ++i) if t[i] l[s[i] – 1] ← a[i];
𝑊 𝑛 = 𝑂(𝑛)𝐷 𝑛 = 𝑂(log 𝑛)
Today …
Intro to Parallel Algorithms
Parallel Search
Parallel Sorting
Merge sort
Sample sort
Bitonic sort
Communication costs
Parallel Search
Problem Description
Given a sorted list 𝑋 of size 𝑛 and an element 𝑦
Find the index 𝑖 | 𝑥𝑖 ≤ 𝑦 < 𝑥𝑖+1
Sequential
Use binary search
𝑂(log𝑛) time
Work depth
parfor(i) if 𝑥𝑖 ≤ 𝑦 < 𝑥𝑖+1 return i; // no duplicates 𝑊 = 𝑛,𝐷 = 1
PRAM
𝑂log 𝑛
log 𝑝using 𝑝 processes
Ranking
Given ordered lists, 𝐴, 𝐵 of lengths 𝑠, 𝑡
Define:
rank 𝑧: 𝐴 ← number of elements 𝑎𝑖 | 𝑎𝑖 ≤ 𝑧
Define:
rank 𝐵: 𝐴 ≔ 𝑟1, 𝑟2, … , 𝑟𝑡𝑟𝑖 ← rank(𝑏𝑖: 𝐴)
Ranking
𝐴 = 7 13 25 26 31 54
𝐵 = [1 8 13 27]
rank 𝐵: 𝐴 = [0 1 2 4]
rank 𝐴: 𝐵 = 1 3 3 3 4 4 4
Use binary search
Consider a multithreaded vs Hadoop implementation
Parallel SortMERGE SORT
Divide & Conquer Merge Sort
Divide 𝑋 into 𝑋1 and 𝑋2
Sort 𝑋1 and 𝑋2
Merge 𝑋1 and 𝑋2
Uses a Binary Tree
Bottom-up approach
Start with the leaves
Climb to the root
Merge the branches
Requires parallel Merge
example
-8, -7, -5, 3, 6, 12, 28, 51
-7, -5, 12, 51
-5, -12
12 -5
-7, 51
-7 51
-8, 3, 6, 28
6, 28
6 28
-8, 3
-8 3
Input
Merge sort
b = Merge_Sort(a,n)
if n < 100
return seqSort(a, n);
b1 = Merge_Sort(a[0,…,n/2-1], n/2);
b2 = Merge_Sort(a[n/2,…,n-1], n/2);
return Merge (b1, b2);
Merge Sort - Complexity
Parallel Merge
Merging two lists of lengths 𝑛,𝑚
Problem description (𝑚 ≤ 𝑛)
Given, 𝐴 = (𝑎1, 𝑎2, … , 𝑎𝑛) and 𝐵 = (𝑏1, 𝑏2, … , 𝑏𝑚)
𝑎𝑖 < 𝑎𝑖+1 ∀𝑖
𝑏𝑖 < 𝑏𝑖+1 ∀𝑖
𝐴 ∩ 𝐵 = ∅
Build 𝐶 = 𝑐1, 𝑐2, … , 𝑐𝑛+𝑚
𝑐𝑖 ∈ 𝐴 ∪ 𝐵
𝑐𝑖 < 𝑐𝑖+1 ∀𝑖
Merging two sorted lists
Best Sequential Time: 𝑂(𝑛)
Parallel Merge:
Tradeoffs between
Depth-Optimal
Work-Optimal
Merging using Ranking
Assume elements in 𝐴 and 𝐵 are distinct
Let 𝐶 be the merged result. Given,
𝑥 ∈ 𝐶
rank 𝑥: 𝐶 = 𝑖
𝑐𝑖 = 𝑥
Propertyrank 𝑥: 𝐶 = rank 𝑥: 𝐴 + rank(𝑥: 𝐵)
Solution to the merging problem,
Find rank 𝐴: 𝐵 and rank(𝐵: 𝐴)
Parallel searches using 𝑝 = 𝑛𝑚, 𝐷 = 𝑂(1) but 𝑊 = 𝑂(𝑛2)
Concurrent binary searches, 𝐷 = 𝑂 log𝑛 and 𝑊 = 𝑂(𝑛 log𝑛)
Goal: Parallelize with optimal work
Recall that an algorithm is work optimal iff 𝑊𝑝 = 𝑊𝑠𝑒𝑞
Example
Work-optimal merge - Merge1
𝐴 = 𝑎1, … , 𝑎𝑛 , 𝐵 = 𝑏1, … , 𝑏𝑚 , 𝑛 ≥ 𝑚
1. Partition 𝐵 into 𝑚
log𝑚blocks
Size of each block log𝑚
2. parallel for i = 1 : 𝑚/ log𝑚
𝑅𝑖 ← rank(𝑏𝑖 log𝑚 ∶ 𝐴) using sequential binary search
3. Partition 𝐴 accordingly
Block 𝐴𝑖 ∶ (𝑎𝑅𝑖−1+1, … , 𝑎𝑅𝑖)
4. Merge blocks 𝐴𝑖 and 𝐵𝑖 in 𝑂(log𝑚) time using sequential merge
But if 𝐴𝑖 ≫ 𝐵𝑖 = log𝑚, then recurse … Merge1(𝐵𝑖 , 𝐴𝑖)
Work ?
Depth ?
Sequential Sorting
What is the complexity ?
𝒪(? )
Sequential Sorting
Comparison based
𝒪(𝑛 log 𝑛)
Can we sort faster than 𝒪(𝑛 log 𝑛) ?
Non-comparison based
𝒪(𝑛)
Bucket sort
Assume input is uniformly distributed over an interval [𝑎, 𝑏]
Divide interval into 𝑚 equal sized intervals (buckets)
Drop numbers into appropriate buckets
Sort each bucket (say using quicksort)
𝒪 𝑛 log𝑛
𝑚
For 𝑚 = 𝒪(𝑛) 𝒪(𝑛) sorting
Radix sort
dense, uniform distribution
Parallel Quicksort
𝑝1 𝑝2 𝑝3 𝑝4
Parallel Quicksort
parallel median selection
𝑝1 𝑝2 𝑝3 𝑝4
Parallel Quicksort
parallel exchange
𝑝1 𝑝2 𝑝3 𝑝4
Parallel Quicksort
𝑝1 𝑝2 𝑝3 𝑝4
Sample Sort
𝑝1 𝑝2 𝑝3 𝑝4
Sample Sort
𝑝1 𝑝2 𝑝3 𝑝4
Sample Sort
𝑝1 𝑝2 𝑝3 𝑝4
𝑝1
Sample Sort
𝑝1 𝑝2 𝑝3 𝑝4
𝑝1
Sample Sort
𝑝1 𝑝2 𝑝3 𝑝4
𝑝1Pick splitters and broadcast
Sample Sort
bucket data & all2all exchange
𝑝1 𝑝2 𝑝3 𝑝4
Sample sort
randomly partition input in 𝑛
𝑝points
sort locally
select 𝑝 splitters/processor (evenly)
guarantees no more 2*n/p elements / bucket (proof?)
gather(splitters) in 𝑝0
sort splitters in 𝑝0 and create buckets
block partition using 𝑝 binary search on n/p sorted seq.
Exchange data
sort
Sample Sort
Sort locally 𝒪𝑛
𝑝log
𝑛
𝑝
Select 𝑝 − 1 splitters per process 𝒪(𝑝)
Gather splitters in 𝑝0 𝒪(𝑝2)
Sort splitters in 𝑝0 𝒪(𝑝2 log 𝑝)
Broadcast splitters 𝒪 𝑝 log 𝑝
Sort again 𝒪𝑛
𝑝log
𝑛
𝑝
Sample Sort – load balance
Guarantees no more 2𝑛
𝑝elements / bucket
Proof:
All entries on 𝑝𝑖 must be > 𝑠𝑖−1 and ≤ 𝑠𝑖
𝑖 − 2 𝑝 +𝑝
2elements of the sample ≤ 𝑠𝑖
lower bound elements = 𝑖−2 𝑝+
𝑝
2𝑛
𝑝2
𝑝 − 𝑖 𝑝 −𝑝
2elements of the sample > 𝑠𝑖
upper bound elements = 𝑝−𝑖 𝑝−
𝑝
2𝑛
𝑝2+
𝑛
𝑝2− 1
Maximum number of elements on processor 𝑖,
𝑛 − 𝑢𝑏 − 𝑙𝑏 =2𝑛
𝑝−
𝑛
𝑝2+ 1 ≤
2𝑛
𝑝∎