Jie Wang - cs.uml.edu

38
Multithreaded Algorithms 1 Jie Wang University of Massachusetts Lowell Department of Computer Science 1 Some of the slides presented are borrowed from University of Washington Professor Marty Stepp’s slides. J. Wang (UMass Lowell) Multithreaded Algorithms 1 / 38

Transcript of Jie Wang - cs.uml.edu

Page 1: Jie Wang - cs.uml.edu

Multithreaded Algorithms 1

Jie Wang

University of Massachusetts LowellDepartment of Computer Science

1Some of the slides presented are borrowed from University of Washington ProfessorMarty Stepp’s slides.

J. Wang (UMass Lowell) Multithreaded Algorithms 1 / 38

Page 2: Jie Wang - cs.uml.edu

Parallel Computing

Sequential algorithms are algorithms for a computing device with asingle processor and random-access memory. These algorithms arealso referred to as singlethreaded algorithms.

Multithreaded algorithms are algorithms for a computing devicewith multiple processors (CPUs, cores) and shared memory.

Parallel computing: Multiple processors simultaneously solve aproblem.For example, split a computing task into several parts and assign aprocessor to solve each part at the same time.

J. Wang (UMass Lowell) Multithreaded Algorithms 2 / 38

Page 3: Jie Wang - cs.uml.edu

Concurrent Computing

Concurrent computing: Multiple execution flows (e.g. threads)access a shared resource at the same time.For example, concurrent computing allows the same data structure tobe updated by multiple threads.

J. Wang (UMass Lowell) Multithreaded Algorithms 3 / 38

Page 4: Jie Wang - cs.uml.edu

Computing with Multiple Cores

Run multiple programs at the same time.For example: Core 1 runs Chrome; Core 2 runs iTunes; Core 3 runsEclipse. OS (Windows, OSX, Linux) gives programs “time slices” ofattention from cores.

Do multiple things in the same program with multiple threads.

Need to rethink everything about algorithms.Harder to write parallel code, especially in Java, C, C++, and othercommon languages.

Each thread is given its own memory for unshared calls and localvariables.

Global objects are shared between multiple threads.

Separate processes do not share memory with each other.

J. Wang (UMass Lowell) Multithreaded Algorithms 4 / 38

Page 5: Jie Wang - cs.uml.edu

Dynamic Multithreading

Specify parallelism without worrying about communications protocols,load balancing, and other vagaries of static-thread programming.

The concurrency platform contains a scheduler for load balancing.

Two features: (1) nested parallelism; (2) parallel loops.Keywords: parallel, spawn, sync

Example: Fibonacci numbers (inefficient):

Fib(n)

1 if n ≤ 12 return n3 else x = Fib(n − 1)4 y = Fib(n − 2)5 return x + y

J. Wang (UMass Lowell) Multithreaded Algorithms 5 / 38

Page 6: Jie Wang - cs.uml.edu

Multithreading Fibonacci

Fib(n − 1) and Fib(n − 2) can be executed independently, and socan be done in parallel.

P-Fib(n)

1 if n ≤ 12 return n3 else x = spawn Fib(n − 1)4 y = Fib(n − 2)5 sync6 return x + y

Nested parallelism occurs when the keyword spawn proceeds afunction, where the parent code may continue to execute in parallelwith the spawned children.

Keyword sync means to wait for all its spawned children to complete.

J. Wang (UMass Lowell) Multithreaded Algorithms 6 / 38

Page 7: Jie Wang - cs.uml.edu

Multithreading and Computation DAG

Multithreaded computation can be modeled as a DAG G = (V ,E ).

Nodes are instructions.Edges are dependencies.

A strand represents a chain of instructions without parallel keywords(i.e., no spawn, sync, or return from a spawn).

Two strands S1 and S2 are logically in series if there is a directedpath from S1 to S2. Otherwise, they are logically in parallel.

A multithreaded computation is a DAG of strands embedded in athree of procedure instances; it starts with a single strand and endsalso with a single strand.

J. Wang (UMass Lowell) Multithreaded Algorithms 7 / 38

Page 8: Jie Wang - cs.uml.edu

Multithreaded Fibonacci Computation

J. Wang (UMass Lowell) Multithreaded Algorithms 8 / 38

Page 9: Jie Wang - cs.uml.edu

Multithreaded Fibonacci Computation Continued

J. Wang (UMass Lowell) Multithreaded Algorithms 9 / 38

Page 10: Jie Wang - cs.uml.edu

Complexity Measures

Assuming ideal parallelism: sequentially consistent shared memoryamong multiple processors.

The memory behaves as if each processor’s instructions were executedsequentially according to some global linear order.

Work: the total time to execute the entire computation on oneprocessor.

The work is equal to the sum of the times taken by each strand.It is simply the number of nodes in a computation DAG if each strandtakes a unit time.

Span: the longest time to execute all strands along any path in theDAG.

It is the number of nodes on a longest path (a.k.a. critical path) ifeach strand takes a unit time.A critical path in a DAG can be found in O(|V |+ |E |) time.

J. Wang (UMass Lowell) Multithreaded Algorithms 10 / 38

Page 11: Jie Wang - cs.uml.edu

Work Law and Span Law

Let TP denote the runtime of a multithreaded algorithm with Pprocessors.

Let T1 denote the work to be done, which is the runtime on a singleprocessor.

Let T∞ denote the span.

Work Law: TP ≥ T1/P.

Span Law: TP ≥ T∞.

Speedup: T1/TP ≤ P.

Linear speedup: T1/TP = Θ(P).Perfect speedup: T1/TP = P.

Parallelism: T1/T∞ is the maximum speedup that can be achieved ifthe number of processors is unbounded.

Beyond the parallelism, the more processors are used the less perfectthe speedup. This is because if P � T1/T∞, then T1/TP � P.

J. Wang (UMass Lowell) Multithreaded Algorithms 11 / 38

Page 12: Jie Wang - cs.uml.edu

The Speedup of Multithreaded Fibonacci Cannot Be MuchLarger than 2.

J. Wang (UMass Lowell) Multithreaded Algorithms 12 / 38

Page 13: Jie Wang - cs.uml.edu

Parallel Slackness

Slackness measure: T1/(PT∞).

If T1/(PT∞) < 1, then perfect speedup cannot be achieved.

This is because T1/TP ≤ T1/T∞ < P.

If the slackness decreases from 1 toward 0, the speedup divergesincreasingly further from being perfect.

J. Wang (UMass Lowell) Multithreaded Algorithms 13 / 38

Page 14: Jie Wang - cs.uml.edu

Greedy Scheduling

In addition to minimizing the work and span, strands must also bescheduled efficiently onto processors.

A multithreaded scheduler schedules the computation withoutadvance knowledge of when strands will be spawned or synced.

Online distributed scheduler: Ideal but hard to analyze.Online centralized scheduler: not ideal but easy to analyze. Example:greedy schedulers.

A greedy scheduler assigns as many strands to processors as possiblein each step.

Complete step: At least P strands are ready to execute in the step.Incomplete step: Fewer than P strands are ready to execute in thestep.

The work law implies that the best possible runtime is TP = T1/P.

The span law implies that the best possible runtime is TP = T∞.

The upper bound of any greedy scheduler is T1/P + T∞.

J. Wang (UMass Lowell) Multithreaded Algorithms 14 / 38

Page 15: Jie Wang - cs.uml.edu

Greedy Scheduler Upper Bound

Theorem 27.1. A greedy scheduler on an ideal parallel computer executesa multithreaded computation in time TP ≤ T1/P + T∞.Proof. Let SC and SI denote the number of complete steps andincomplete steps. Then TP ≤ SC + SI .We first estimate SC the number of complete steps. If SC ≥ bT1/Pc+ 1,then the total work of complete steps is at least

P(bT1/Pc+ 1) = PbT1/Pc+ P

= T1 − (T1 mod P) + P

> T1.

This is impossible. Thus, SC ≤ bT1/Pc.

J. Wang (UMass Lowell) Multithreaded Algorithms 15 / 38

Page 16: Jie Wang - cs.uml.edu

Greedy Scheduler Upper Bound Continued

Let G be the computation DAG and, w.l.o.g. assume that each strandruns in unit time.

Let G ′ be the subgraph yet to be executed at the start of theincomplete step.

Let G ′′ be the subgraph remaining to be executed after theincomplete step.

Let L be the length of a longest path in G ′. Then the length of a longestin G ′′ must be L− 1.Since L ≤ T∞, we have SI ≤ T∞.This completes the proof.

J. Wang (UMass Lowell) Multithreaded Algorithms 16 / 38

Page 17: Jie Wang - cs.uml.edu

Approximation Ratio

Corollary 27.2. The runtime TP of a greedy scheduler is anapproximation to runtime T ∗P of the optimal scheduler within a factor of 2.Proof. The work law and span law imply that T ∗P ≥ max{T1/P,T∞}.From Theorem 27.1:

TP ≤ T1/P + T∞

≤ 2 max{T1/P,T∞}≤ 2T ∗P .

Corollary 27.3. If P � T1/T∞, then T1/TP ≈ P.Proof. Since P � T1/T∞, we have T∞ � T1/P. Thus,TP ≤ T1/P + T∞ ≈ T1/P.On the other hand, TP ≥ T1/P by the work law. Thus, T1/TP ≈ P.

J. Wang (UMass Lowell) Multithreaded Algorithms 17 / 38

Page 18: Jie Wang - cs.uml.edu

Analysis of Parallelism

To analyze the parallelism of the multithreaded Fibonacci P-Fib(n):

T∞(n) = max{T∞(n − 1),T∞(n − 2)}+ Θ(1)

= T∞(n − 1) + Θ(1)

= Θ(n).

Since T1(n) = Θ(Φn), we have

T1(n)/T∞(n) = Θ(Φn/n).

J. Wang (UMass Lowell) Multithreaded Algorithms 18 / 38

Page 19: Jie Wang - cs.uml.edu

Parallel Loops

Consider matrix-vector multiplication, where A = (aij)n×n is a matrix,x = (xj)n is an n-vector, and y = (yi )n is the resulting vector with

yi =n∑

j=1

aijxj .

J. Wang (UMass Lowell) Multithreaded Algorithms 19 / 38

Page 20: Jie Wang - cs.uml.edu

Parallel Loops Continued

A compiler implements each parallel for loop as a divide-and-conquersubroutine using nested parallelism.

For example, the compiler implements the parallel for loop in lines5–7 with the call to MAT-VEC-MAIN-LOOP(A, x , y , n, 1, n),where

J. Wang (UMass Lowell) Multithreaded Algorithms 20 / 38

Page 21: Jie Wang - cs.uml.edu

DAG for MAT-VEC-MAIN-LOOP(A, x , y , 8, 1, 8)

J. Wang (UMass Lowell) Multithreaded Algorithms 21 / 38

Page 22: Jie Wang - cs.uml.edu

Parallelism

It is straightforward that T1(n) = Θ(n2).

T∞(n) = Θ(log n) + max1≤i≤n

iter∞(i)

= Θ(n).

Thus, the parallelism T1(n)/T∞(n) = Θ(n2/n) = Θ(n).

J. Wang (UMass Lowell) Multithreaded Algorithms 22 / 38

Page 23: Jie Wang - cs.uml.edu

Race Conditions

J. Wang (UMass Lowell) Multithreaded Algorithms 23 / 38

Page 24: Jie Wang - cs.uml.edu

Illustration of the Determinacy Race

J. Wang (UMass Lowell) Multithreaded Algorithms 24 / 38

Page 25: Jie Wang - cs.uml.edu

A Chess Lesson

This is a true story (with simplification on timings)

A multitreaded chess-playing program was built with T1 = 2048 andT∞ = 1.

It was prototyped on a 32-processor computer withT32 = 2048/32 + 1 = 65 and will be implemented on a 512-processormachine.

Later an optimization was found with T ′1 = 1024 and T ′∞ = 8. Thus,T ′32 = 1024/32 + 8 = 40.

On a 512-node computer, using the original program we haveT512 = 2048/512 + 1 = 5. With the optimized program we haveT ′512 = 1024/512 + 8 = 10.

Thus, although the optimized algorithms runs better on a 32-nodemachine, it’s worse than the original program on a 512-node machine.

J. Wang (UMass Lowell) Multithreaded Algorithms 25 / 38

Page 26: Jie Wang - cs.uml.edu

Thread and Runnable

J. Wang (UMass Lowell) Multithreaded Algorithms 26 / 38

Page 27: Jie Wang - cs.uml.edu

Waiting for A Thread

J. Wang (UMass Lowell) Multithreaded Algorithms 27 / 38

Page 28: Jie Wang - cs.uml.edu

MultiThread Example: Parallel Summation

J. Wang (UMass Lowell) Multithreaded Algorithms 28 / 38

Page 29: Jie Wang - cs.uml.edu

Parallelizing

J. Wang (UMass Lowell) Multithreaded Algorithms 29 / 38

Page 30: Jie Wang - cs.uml.edu

Initial Steps

J. Wang (UMass Lowell) Multithreaded Algorithms 30 / 38

Page 31: Jie Wang - cs.uml.edu

Runnable Partial Sum

J. Wang (UMass Lowell) Multithreaded Algorithms 31 / 38

Page 32: Jie Wang - cs.uml.edu

Sum Method with Threads

J. Wang (UMass Lowell) Multithreaded Algorithms 32 / 38

Page 33: Jie Wang - cs.uml.edu

Three or more Threads

J. Wang (UMass Lowell) Multithreaded Algorithms 33 / 38

Page 34: Jie Wang - cs.uml.edu

MultiThread Example: Parallel MergeSort

J. Wang (UMass Lowell) Multithreaded Algorithms 34 / 38

Page 35: Jie Wang - cs.uml.edu

Runnable MergeSort

J. Wang (UMass Lowell) Multithreaded Algorithms 35 / 38

Page 36: Jie Wang - cs.uml.edu

MergeSort with Multiple Threads

J. Wang (UMass Lowell) Multithreaded Algorithms 36 / 38

Page 37: Jie Wang - cs.uml.edu

Three or more Threads

J. Wang (UMass Lowell) Multithreaded Algorithms 37 / 38

Page 38: Jie Wang - cs.uml.edu

Map/Reduce (Google’s Invention)

J. Wang (UMass Lowell) Multithreaded Algorithms 38 / 38