Parallel Algorithms - School of Computing · 2018-02-22 · Bucket sort Assume input is uniformly...

Parallel AlgorithmsPART 2

Last time …

Introduction to Parallel Algorithms

Complexity analysis

Work/Depth model

Prefix Sum, Parallel Select

Questions?

Parallel Select

Select numbers < pivot

𝐴 ← [1 2 3 0 4 0 2 3 0 1 3 4]

pivot ← 2

[1 0 0 1 0 1 0 0 1 1 0 0]

[1 1 1 2 2 3 3 3 4 5 5 5]

Parallel Select the 𝑎𝑖 < pv

[l,m] ← select_lower (a, n, pv)

// t = t[0,…,n-1]

parfor (i=0; i<n; ++i) t[i] ← a[i] < pv;

s ← scan (t); m ← s(n-1);

parfor (i=0; i<n; ++i) if t[i] l[s[i] – 1] ← a[i];

𝑊 𝑛 = 𝑂(𝑛)𝐷 𝑛 = 𝑂(log 𝑛)

Today …

Intro to Parallel Algorithms

Parallel Search

Parallel Sorting

Merge sort

Sample sort

Bitonic sort

Communication costs

Parallel Search

Problem Description

Given a sorted list 𝑋 of size 𝑛 and an element 𝑦

Find the index 𝑖 | 𝑥𝑖 ≤ 𝑦 < 𝑥𝑖+1

Sequential

Use binary search

𝑂(log𝑛) time

Work depth

parfor(i) if 𝑥𝑖 ≤ 𝑦 < 𝑥𝑖+1 return i; // no duplicates 𝑊 = 𝑛,𝐷 = 1

PRAM

𝑂log 𝑛

log 𝑝using 𝑝 processes

Ranking

Given ordered lists, 𝐴, 𝐵 of lengths 𝑠, 𝑡

Define:

rank 𝑧: 𝐴 ← number of elements 𝑎𝑖 | 𝑎𝑖 ≤ 𝑧

Define:

rank 𝐵: 𝐴 ≔ 𝑟1, 𝑟2, … , 𝑟𝑡𝑟𝑖 ← rank(𝑏𝑖: 𝐴)

Ranking

𝐴 = 7 13 25 26 31 54

𝐵 = [1 8 13 27]

rank 𝐵: 𝐴 = [0 1 2 4]

rank 𝐴: 𝐵 = 1 3 3 3 4 4 4

Use binary search

Consider a multithreaded vs Hadoop implementation

Parallel SortMERGE SORT

Divide & Conquer Merge Sort

Divide 𝑋 into 𝑋1 and 𝑋2

Sort 𝑋1 and 𝑋2

Merge 𝑋1 and 𝑋2

Uses a Binary Tree

Bottom-up approach

Start with the leaves

Climb to the root

Merge the branches

Requires parallel Merge

example

-8, -7, -5, 3, 6, 12, 28, 51

-7, -5, 12, 51

-5, -12

12 -5

-7, 51

-7 51

-8, 3, 6, 28

6, 28

6 28

-8, 3

-8 3

Input

Merge sort

b = Merge_Sort(a,n)

if n < 100

return seqSort(a, n);

b1 = Merge_Sort(a[0,…,n/2-1], n/2);

b2 = Merge_Sort(a[n/2,…,n-1], n/2);

return Merge (b1, b2);

Merge Sort - Complexity

Parallel Merge

Merging two lists of lengths 𝑛,𝑚

Problem description (𝑚 ≤ 𝑛)

Given, 𝐴 = (𝑎1, 𝑎2, … , 𝑎𝑛) and 𝐵 = (𝑏1, 𝑏2, … , 𝑏𝑚)

𝑎𝑖 < 𝑎𝑖+1 ∀𝑖

𝑏𝑖 < 𝑏𝑖+1 ∀𝑖

𝐴 ∩ 𝐵 = ∅

Build 𝐶 = 𝑐1, 𝑐2, … , 𝑐𝑛+𝑚

𝑐𝑖 ∈ 𝐴 ∪ 𝐵

𝑐𝑖 < 𝑐𝑖+1 ∀𝑖

Merging two sorted lists

Best Sequential Time: 𝑂(𝑛)

Parallel Merge:

Tradeoffs between

Depth-Optimal

Work-Optimal

Merging using Ranking

Assume elements in 𝐴 and 𝐵 are distinct

Let 𝐶 be the merged result. Given,

𝑥 ∈ 𝐶

rank 𝑥: 𝐶 = 𝑖

𝑐𝑖 = 𝑥

Propertyrank 𝑥: 𝐶 = rank 𝑥: 𝐴 + rank(𝑥: 𝐵)

Solution to the merging problem,

Find rank 𝐴: 𝐵 and rank(𝐵: 𝐴)

Parallel searches using 𝑝 = 𝑛𝑚, 𝐷 = 𝑂(1) but 𝑊 = 𝑂(𝑛2)

Concurrent binary searches, 𝐷 = 𝑂 log𝑛 and 𝑊 = 𝑂(𝑛 log𝑛)

Goal: Parallelize with optimal work

Recall that an algorithm is work optimal iff 𝑊𝑝 = 𝑊𝑠𝑒𝑞

Example

Work-optimal merge - Merge1

𝐴 = 𝑎1, … , 𝑎𝑛 , 𝐵 = 𝑏1, … , 𝑏𝑚 , 𝑛 ≥ 𝑚

1. Partition 𝐵 into 𝑚

log𝑚blocks

Size of each block log𝑚

2. parallel for i = 1 : 𝑚/ log𝑚

𝑅𝑖 ← rank(𝑏𝑖 log𝑚 ∶ 𝐴) using sequential binary search

3. Partition 𝐴 accordingly

Block 𝐴𝑖 ∶ (𝑎𝑅𝑖−1+1, … , 𝑎𝑅𝑖)

4. Merge blocks 𝐴𝑖 and 𝐵𝑖 in 𝑂(log𝑚) time using sequential merge

But if 𝐴𝑖 ≫ 𝐵𝑖 = log𝑚, then recurse … Merge1(𝐵𝑖 , 𝐴𝑖)

Work ?

Depth ?

Sequential Sorting

What is the complexity ?

𝒪(? )

Sequential Sorting

Comparison based

𝒪(𝑛 log 𝑛)

Can we sort faster than 𝒪(𝑛 log 𝑛) ?

Non-comparison based

𝒪(𝑛)

Bucket sort

Assume input is uniformly distributed over an interval [𝑎, 𝑏]

Divide interval into 𝑚 equal sized intervals (buckets)

Drop numbers into appropriate buckets

Sort each bucket (say using quicksort)

𝒪 𝑛 log𝑛

𝑚

For 𝑚 = 𝒪(𝑛) 𝒪(𝑛) sorting

Radix sort

dense, uniform distribution

Parallel Quicksort

𝑝1 𝑝2 𝑝3 𝑝4

Parallel Quicksort

parallel median selection

𝑝1 𝑝2 𝑝3 𝑝4

Parallel Quicksort

parallel exchange

𝑝1 𝑝2 𝑝3 𝑝4

Parallel Quicksort

𝑝1 𝑝2 𝑝3 𝑝4

Sample Sort

𝑝1 𝑝2 𝑝3 𝑝4

Sample Sort

𝑝1 𝑝2 𝑝3 𝑝4

𝑝1

Sample Sort

𝑝1 𝑝2 𝑝3 𝑝4

𝑝1Pick splitters and broadcast

Sample Sort

bucket data & all2all exchange

𝑝1 𝑝2 𝑝3 𝑝4

Sample sort

randomly partition input in 𝑛

𝑝points

sort locally

select 𝑝 splitters/processor (evenly)

guarantees no more 2*n/p elements / bucket (proof?)

gather(splitters) in 𝑝0

sort splitters in 𝑝0 and create buckets

block partition using 𝑝 binary search on n/p sorted seq.

Exchange data

sort

Sample Sort

Sort locally 𝒪𝑛

𝑝log

𝑛

𝑝

Select 𝑝 − 1 splitters per process 𝒪(𝑝)

Gather splitters in 𝑝0 𝒪(𝑝2)

Sort splitters in 𝑝0 𝒪(𝑝2 log 𝑝)

Broadcast splitters 𝒪 𝑝 log 𝑝

Sort again 𝒪𝑛

𝑝log

𝑛

𝑝

Sample Sort – load balance

Guarantees no more 2𝑛

𝑝elements / bucket

Proof:

All entries on 𝑝𝑖 must be > 𝑠𝑖−1 and ≤ 𝑠𝑖

𝑖 − 2 𝑝 +𝑝

2elements of the sample ≤ 𝑠𝑖

lower bound elements = 𝑖−2 𝑝+

𝑝

2𝑛

𝑝2

𝑝 − 𝑖 𝑝 −𝑝

2elements of the sample > 𝑠𝑖

upper bound elements = 𝑝−𝑖 𝑝−

𝑝

2𝑛

𝑝2+

𝑛

𝑝2− 1

Maximum number of elements on processor 𝑖,

𝑛 − 𝑢𝑏 − 𝑙𝑏 =2𝑛

𝑝−

𝑛

𝑝2+ 1 ≤

2𝑛

𝑝∎

Parallel Algorithms - School of Computing · 2018-02-22 · Bucket sort Assume input is uniformly...

Documents

Transcript of Parallel Algorithms - School of Computing · 2018-02-22 · Bucket sort Assume input is uniformly...