Post on 08-Jul-2020
Determinis)c Parallel Algorithms and Programming
Guy Blelloch Carnegie Mellon University
EC2, 2011 1
Parallelism vs. Concurrency
Concurrency
sequential concurrent
Parallelism serial Traditional
programming Traditional OS
parallel Deterministic parallelism
General parallelism
EC2, 2011 2
! Parallelism: using multiple processors/cores running at the same time. Property of the machine
! Concurrency: non-determinism due to interleaving threads. Needed for some “interactive” applications.
Determinis)c parallel algorithms/programs have great proper)es, but how prac)cal are they?
Outline:
1. What are the nice proper)es
2. How general are they 3. Recent results on performance of
determinis)c algorithms
EC2, 2011 3
Concurrency : Stack Example 1 struct link {int v; link* next;}
struct stack { link* headPtr;
void push(link* a) { a->next = headPtr; headPtr = a; }
link* pop() { link* h = headPtr; if (headPtr != NULL) headPtr = headPtr->next; return h;} }
EC2, 2011 4
H
A
H
A
Concurrency : Stack Example 1 struct link {int v; link* next;}
struct stack { link* headPtr;
void push(link* a) { a->next = headPtr; headPtr = a; }
link* pop() { link* h = headPtr; if (headPtr != NULL) headPtr = headPtr->next; return h;} }
EC2, 2011 5
H
A
B
Concurrency : Stack Example 1 struct link {int v; link* next;}
struct stack { link* headPtr;
void push(link* a) { a->next = headPtr; headPtr = a; }
link* pop() { link* h = headPtr; if (headPtr != NULL) headPtr = headPtr->next; return h;} }
EC2, 2011 6
H
A
B
Concurrency : Stack Example 1 struct link {int v; link* next;}
struct stack { link* headPtr;
void push(link* a) { a->next = headPtr; headPtr = a; }
link* pop() { link* h = headPtr; if (headPtr != NULL) headPtr = headPtr->next; return h;} }
EC2, 2011 7
H
A
B
Concurrency : Stack Example 2 struct stack { link* headPtr;
void push(link* a) { do { link* h = headPtr; a->next = h; while (!CAS(&headPtr, h, a)); }
link* pop() { do { link* h = headPtr; if (h == NULL) return NULL; link* nxt = h->next; while (!CAS(&headPtr, h, nxt))} return h;} } EC2, 2011 8
H
A
Concurrency : Stack Example 2 struct stack { link* headPtr;
void push(link* a) { do { link* h = headPtr; a->next = h; while (!CAS(&headPtr, h, a)); }
link* pop() { do { link* h = headPtr; if (h == NULL) return NULL; link* nxt = h->next; while (!CAS(&headPtr, h, nxt))} return h;} } EC2, 2011 9
H
A
Concurrency : Stack Example 2 struct stack { link* headPtr;
void push(link* a) { do { link* h = headPtr; a->next = h; while (!CAS(&headPtr, h, a)); }
link* pop() { do { link* h = headPtr; if (h == NULL) return NULL; link* nxt = h->next; while (!CAS(&headPtr, h, nxt))} return h;} } EC2, 2011 10
H
A
B
Concurrency : Stack Example 2 struct stack { link* headPtr;
void push(link* a) { do { link* h = headPtr; a->next = h; while (!CAS(&headPtr, h, a)); }
link* pop() { do { link* h = headPtr; if (h == NULL) return NULL; link* nxt = h->next; while (!CAS(&headPtr, h, nxt))} return h;} } EC2, 2011 11
H
A
B
Concurrency : Stack Example 2’ P1 : x = s.pop(); y = s.pop(); s.push(x);
P2 : z = s.pop();
EC2, 2011 12
The ABA problem
Can be fixed with counter and 2CAS, but…
A B C
B C
Before:
After: P2: h = headPtr; P2: nxt = h->next; P1: everything P2: CAS(&headPtr,h,nxt)
Concurrency : Stack Example 3 struct link {int v; link* next;}
struct stack { link* headPtr;
void push(link* a) { atomic { a->next = headPtr; headPtr = a; }}
link* pop() { atomic { link* h = headPtr; if (headPtr != NULL) headPtr = headPtr->next; return h;}} }
EC2, 2011 13
Concurrency : Stack Example 3’ void swapTop(stack s) { link* x = s.pop(); link* y = s.pop(); push(x); push(y);
}
Queues are trickier than stacks.
EC2, 2011 14
Race Free ≠ Determinis)c Transac)ons ≠ Determinis)c
Linearizability ≠ Determinis)c
EC2, 2011 15
EC2, 2011 16
Parallel Example: Quicksort
function quicksort(S) = if (#S <= 1) then S else let a = S[rand(#S)]; S1 = {e in S | e < a}; S2 = {e in S | e = a}; S3 = {e in S | e > a}; R = {quicksort(v) : v in [S1, S3]}; in R[0] ++ S2 ++ R[1];
{ … } – means available parallelism Dynamically scheduled (e.g. work stealing) Reasoning on correctness is no different than what we
already teach.
Advantages of Determinism
1. Easier to understand and reason about code 2. Composable
3. Easier to write asser)ons 4. Easier to formally verify (avoids exponen)al
search across interleavings) 5. Easier to debug (avoid Heizenbugs) 6. Easier to understand performance
7. Easier for hardware/compiler to op)mize EC2, 2011 17
How to Get Determinis)c Parallelism
What is the most general model?
EC2, 2011 18
How to Get Determinis)c Parallelism
What is the most general model? Strictly synchronous (data-‐parallel)?
Purely func)onal?
EC2, 2011 19
How to Get Determinis)c Parallelism
What is the most general model? Strictly synchronous (data-‐parallel)?
Purely func)onal?
It is Undecidable to determine if a program returns the same result independent of interleaving.
EC2, 2011 20
How to Get Determinis)c Parallelism
What is the most general model? Undecidable But, here is the most general defini)on I know of:
1. Arbitrary spawns by a thread of any number of child threads.
2. Termina)on of threads at any point.
3. Synchroniza)on among threads only through condi,on-‐variables
4. All concurrent opera)ons on state commute
EC2, 2011 21
Condi)on Variables
cond* x = new cond;
Create a condi)on variable (initally “clear”) x->signal();
Set the signal, all wai)ng threads can proceed. Must only be called at most once. Needs to be verified.
x->wait();
Wait un)l the signal in set.
These define a dependence graph on program instruc)ons with cross edges from a signal on x to each of the waits on x.
EC2, 2011 22
Nested Parallelism
Condi)on variables are hard to work with, so
We consider nested parallel computa,ons arbitrary nes)ng of fork-‐join and parallel loops
Has some important advantages: – Good for caching – Reduces scheduling overhead – Supported by many languages – Easy to analyze costs – Makes it easier to verify code??
EC2, 2011 23
Commu)ng Opera)ons (roughly)
Opera)ons E1 and E2 are concurrent if no path in the dependence graph between them.
Let E(M) -‐> M’ be an opera)on that transforms the state from M to M’.
E1 and E2 commute with respect to state M if E1(E2(M)) = E2(E1(M))
Any concurrent opera)ons must commute with respect to all states. Needs to be verified
EC2, 2011 24
Commu)ng Opera)ons (roughly)
Examples:
read (or any non modifying query on a data structure) writeMin, writeAndAdd insert into an ordered dic)onary
delete from a dic)onary but insert does not commute with delete
uniqueLabel
EC2, 2011 25
Our Experiments
We coded up 16 benchmark problems using nested parallelism and commuta)ve opera)ons (determinis)c).
Trying to answer the ques)on: can one get good efficiency with determinis)c parallelism.
EC2, 2011 26
Preliminary Benchmarks I
EC2, 2011 27
Generic Comparison Sor)ng
Removing Duplicates
Dic)onary
Graphs Breadth First Search
Graph Separators
Minimum Spanning Tree
Maximal Independent Set
Geometry/Graphics
Delaunay Triangula)on and Refinement
Convex Hulls
Ray Cas)ng
Preliminary Benchmarks II
EC2, 2011 28
Machine Learning
All Nearest Neighbors
Support Vector Machines *
K-‐Means *
Text Processing
Suffix Arrays
Edit Distance
String Search
Science Nbody
Phylogene)c tree *
Numerical Sparse Matrix Vector Mul)ply
Sparse Linear Solve *
EC2, 2011 29
Preliminary Numbers
Sort Performance, More Detail
weight STL Sort Sanders Sort Quicksort SampleSort SampleSort
Cores 1 32 32 32 1
Uniform .1 15.8 1.06 4.22 .82 20.2
Exponen)al .1 10.8 .79 2.49 .53 13.8
Almost Sorted .1 3.28 1.11 1.76 .27 5.67
Trigram Strings .2 58.2 4.63 8.6 1.05 30.8
Strings Permuted .2 82.5 7.08 28.4 1.76 49.3
Structure .3 17.6 2.03 6.73 1.18 26.7
Average 36.4 3.24 10.3 1.08 28.0
EC2, 2011 30
All inputs are 100,000,000 long. All code written run on Cilk++ (also tested in Cilk+) All experiments on 32 core Nehalem (4 X x7560)
Main Techniques Mostly: • Func)onal programming style: no concurrent updates
But also:
• Commuta)ve updates with history Independent Data Structures
• Priority ordering on elements – i.e. result is same as if added sequen)ally. Makes heavy use of “write-‐min” atomic opera)on.
EC2, 2011 31
Easy Cases
Just func)onal programming style • Sample sort
• Nearest Neighbors • Nbody • Convext Hull • Sparse MV Product
• Suffix Arrays
• Ray Cas)ng EC2, 2011 32
3 Examples of Other Cases
• Dic)onary • Breadth First Search • Delaunay Triangula)on and Refinement
EC2, 2011 33
Dic)onary Using hashing: – Based on generic hash and comparison
– Problem: representa)on can depend on ordering. Also on which redundant element is kept.
– Solu)on: Use history independent hash table based on linear probing…representa)on is independent of order of inser)on
– Use write-‐min on collision
EC2, 2011 34
6 7 3 11 9 5 8
7, 11 3 9 8, 5 6
Breadth First Search (BFS)
Goal: generate the same BFS (spanning) tree as the sequen)al Q based algorithm.
EC2, 2011 35
Breadth First Search (BFS)
Sequen)al algorithm:
EC2, 2011 36
Breadth First Search (BFS)
Another possible tree:
EC2, 2011 37
Breadth First Search (BFS)
Solu)on: – Maintain Fron)er and priority order it
– Use writeMin to choose winner.
EC2, 2011 38
1
1
2
1
2
3
1
2
3
4
Delaunay Triangula)on/Refinement
• Incremental algorithm adds one point at a )me, but points can be added in parallel if they don’t interact.
• The problem is that the output will depend on the order they are added.
EC2, 2011 39
Delaunay Triangula)on/Refinement
• Adding points determinis)cally
EC2, 2011 40
Delaunay Triangula)on/Refinement
• Adding points determinis)cally
EC2, 2011 41
Delaunay Triangula)on/Refinement
• Adding points determinis)cally
EC2, 2011 42
Delaunay Triangula)on/Refinement
• Adding points determinis)cally
EC2, 2011 43
15
16 16
16
16
16
16
15
15
Conclusions
It seems possible to achieve efficient determinis)c parallelism for a variety of algorithms.
Verifying single write on condi)on variables and that opera)ons comute
EC2, 2011 44
Parallelism Example: Convex Hull function hsplit(points,p1,p2) = let d = {distance(p,(p1,p2)): p in points}; p’ = {p in points; d | plusp(d)}; in if (#p’ < 2) then [p1] ++ p’ else let pm = points[max_index(d)]; in flatten({hsplit(p’, p1, p2): p1 in [p1, pm]; p2 in [pm,p2]})
function convex_hull(points) = let x = {x : (x,y) in points}; minx = points[min_index(x)]; maxx = points[max_index(x)]; in hsplit(points,minx,maxx) ++ hsplit(points,maxx,minx);
EC2, 2011 45
Commu)ng Opera)ons (
More formally (Guy Steele, 1990): if
and
E1 and E2 commute with respect to M iff:
Ma’’ = Mb
’’ , V1a = V1b and, V2a = V2b
where the => V are the values returned by the opera)on.
EC2, 2011 46
€
E1 M( ) →Ma' ⇒V1a
€
E2 Ma'( ) →Ma
'' ⇒V2a
€
E1 Mb'( ) →Mb
'' ⇒V1b
€
E2 M( ) →Mb' ⇒V2b
Nested Parallelism: parallel loops cilk_for (i=0; i < n; i++) ! B[i] = A[i]+1;!
Parallel.ForEach(A, x => x+1);!
B = {x + 1 : x in A}!
#pragma omp for !for (i=0; i < n; i++) ! B[i] = A[i]+1;!
Page47
Cilk
Microsor TPL (C#,F#)
Nesl, Parallel Haskell
OpenMP
EC2, 2011
Nested Parallelism: fork-‐join cobegin { ! S1;! S2;}!
coinvoke(f1,f2)!Parallel.invoke(f1,f2)!
spawn S1;!S2;!sync;!
(exp1 || exp2)!
Page48
Dates back to the 70s or possibly 60s. Used in dialects of Pascal
Java fork-‐join framework
Microsor TPL (C#,F#)
Cilk+
Various func)onal languages
EC2, 2011
Serial Parallel DAGs Dependence graphs of nested parallel computa)ons are series parallel
Page49 EC2, 2011