Chapter 4, CLR Textbook

46
Chapter 4, CLR Textbook Algorithms on Rings of Processors

description

Chapter 4, CLR Textbook. Algorithms on Rings of Processors. Algorithms on Rings of Processors. When using message passing, it is common to abstract away from the physical network and to choose a convenient logical network instead. - PowerPoint PPT Presentation

Transcript of Chapter 4, CLR Textbook

Page 1: Chapter 4, CLR Textbook

Chapter 4, CLR Textbook

Algorithms on Rings of Processors

Page 2: Chapter 4, CLR Textbook

Algorithms on Rings of Processors

• When using message passing, it is common to abstract away from the physical network and to choose a convenient logical network instead.

• This chapter presents several algorithms intended for the logical ring network studied earlier

• Coverage of mapping logical networks map onto physical networks are deferred to Sections 4.6 and 4.7

• Rings are linear interconnection network– Ideal for a first look at distributed memory algorithms– Each processor has a single predecessor and

successor

Page 3: Chapter 4, CLR Textbook

Matrix-Vector Multiplication• The first unidirectional ring algorithm will be the

multiplication y = Ax of a nn matrix A by a vector x of dimension n.

1. for i = 0 to n-1 do

2. yi 0

3. for j = 0 to n-1 do

4. yi = yi + Ai,j xj

• Each outer (e.g., i) loop computes the scalar product of one row of A by vector x.

• These scalar products can be performed in any order.

• These scalar products will be distributed among the processors so these can be done in parallel.

Page 4: Chapter 4, CLR Textbook

Matrix-Vector Multiplication (cont.)• We assume that n is divisible by p and let r = n/p.• Each processor must store r contiguous rows of matrix

A and r scalar products.– This is called a block row.

• The corresponding r components of the vector y and x are also stored with each processor.

• Each processor Pq will then store

– Rows qr to (q+1)r -1 of matrix A of dimension rn– Components qr to (q+1)r -1 of vectors x and y.• For simplicity, we will ignore the case where n is not

divisible by p. • However, this case can be handled by temporarily

adding additional rows of zeros to matrix and zeros to vector x so the resulting nr. of rows is divisible by p.

Page 5: Chapter 4, CLR Textbook

Matrix-Vector Multiplication (cont.)

• Declarations needed– var A: array[0..r-1,0..n-1] of real;– var x, y: array [0..r-1] of real;

• Then A[0,0] on P0 corresponds to A0,0 but on P1 to Ar,0.

– Note the subscript are global while array indices are local.

• Also, note global index (i,j) corresponds to local index (i - i/r, j) on processor Pk where k = i/r

• The next figure illustrates how the rows and vectors are partitioned among the processors.

Page 6: Chapter 4, CLR Textbook
Page 7: Chapter 4, CLR Textbook

Matrix-Vector Multiplication (cont.)• The partitioning of the data makes it possible to

solve larger problems in parallel.

• The parallel algorithm can solve a problem roughly p times larger than the sequential algorithm.

• Algorithm 4.1 is given on the next slide.

• For each loop in Algorithm 4.1, each processor Pq computes the scalar product of its rr matrix with a vector of size r.– This is a partial result.

– The values of the components of y assigned to Pq is obtained by adding all of these partial results together.

Page 8: Chapter 4, CLR Textbook
Page 9: Chapter 4, CLR Textbook

Matrix-Vector Multiplication (cont.)

• In the first pass through the loop, the x-components in Pq are the ones originally assigned.

• During each pass through the loop, Pq executes the scalar product of the appropriate part of the qth block of A with Pq ‘s current components of x.

– Concurrently, each Pq sends its block of x-values to Pq+1 (mod p) and receives a new block of x-values from Pq-1 (mod p).

– At the conclusion, each Pq has its original block of x-values, but has calculated the correct values for its y-components.

• These steps are illustrated in Figure 4.2.

Page 10: Chapter 4, CLR Textbook
Page 11: Chapter 4, CLR Textbook

Analysis of Matrix-Vector Multiplication

• There are p identical steps• Each step involves three activities: compute, send, and

receive.• The time to send and receive are identical and

concurrent, so the execution time is

T(p) = p max{ r2w, L+rb}

where w is the computation time for multiplying a vector component by matrix component & adding two products,

b is the inverse of the bandwidth, and L is communications startup cost.

• As r = n/p, the computation cost becomes asymptotically larger than the communication cost as n increases, since

(for n large)

2

2

2 n npp

r w w L b

Page 12: Chapter 4, CLR Textbook

Matrix-Vector Multiplication Analysis (cont)

• Next, we calculate various metrics and their complexity• For large n,

– T(p) = p(r2w) = n2w/p or O(n2/p) O(n2) if p constant

– The cost = (n2w/p)*p = n2w or O(n2)

– The speedup = ts/T(p) = cn2 * (p/n2w) = (c/w)p or O(p)

• However if p is constant/small, the speedup is only O(1)

– The efficiency = ts/cost = cn2/ (n2w) = c/w or O(1)

• Note efficiency = tsp/tp O(1)

• Note that if vector x were duplicated across all processors, then there would be no need for any communication and parallel efficiency would be O(1) for all values of n. – However, there would be an increased memory cost

Page 13: Chapter 4, CLR Textbook

Matrix-Matrix Multiplication• Using matrix-vector multiplication, this is easy.• Let C = AB, where all are nn matrices• The multiplication consists of computing n scalar

products:

for i = 0 to n-1 do

for j = 0 to n-1 do

Ci,j = 0

for k = 0 to n-1 do

Ci,j = Ci,j + Ai,k Bk,j

• We will distribute the matrices over the p processors, giving each the first processor the first r = n/p rows, etc.

• Declaration:

var A, B, C: array[0…r-1, 0…r-1] of reals.

Page 14: Chapter 4, CLR Textbook
Page 15: Chapter 4, CLR Textbook

Matrix-Matrix Multiplication & Analysis• This algorithm is very similar to the one for matrix-vector

multiplication– Scalar products are replaced by sub-matrix

multiplication– Circular shifting of a vector is replaced by circular

shifting of matrix rows• Analysis:

– Each step lasts as long as the longest of the three activities performed during the step: Compute, send, and receive.

– T(p) = p * max{ nr2w, L+nrb}– As before, the asymptotic parallel efficiency is 1 when

n is large.

Page 16: Chapter 4, CLR Textbook

Matrix-Matrix Multiplication Analysis• Naïve Algorithm: matrix-matrix could be achieved by

executing matrix-vector multiplication n times• Analysis of Naïve algorithm:

– Execution time is just the time for matrix-vector multiplication, multiplied by n.

– T’(p) = p max{ nr2w, nL +nrb}– The only difference between T and T’ is that term L

has become nL• Naïve approach exchange vectors of size r • In the algorithm while in the algorithm developed in

this section, they exchange matrices of size r n– This does not change asymptotic efficiency– However, sending data in bulk can significantly

reduce the communications overhead.

Page 17: Chapter 4, CLR Textbook

Stencil Applications • Popular applications that operate on a discrete domain

that consists of cells.• Each cell holds some value(s) and has neighbor cells.• The application uses an application that applies pre-

defined rules to update the value(s) of a cell using the values of the neighbor cells.

• The location of the neighbor cells and the function used to update cell values constitute a stencil that is applied to all cells in the domain.

• These type of applications arise in many areas of science and engineering.

• Examples include image processing, approximate solutions to differential equations, and simulation of complex cellular automata (e.g., Conway’s Game of Life)

Page 18: Chapter 4, CLR Textbook

A Simple Sequential Algorithm

• We consider a stencil application on a 2D domain of size nn.

• Each cell has 8 neighbors, as shown below:

NW N NE

W c E

SW S SE• The algorithm we consider updates the values of Cell c

based on the value of the already updated value of its West and North neighbors.

• The stencil is shown on the next slide and can be formalized as

cnew UPDATE(cold, Wnew, Nnew)

Page 19: Chapter 4, CLR Textbook
Page 20: Chapter 4, CLR Textbook

A Simple Sequential Algorithm (cont)

• This simple stencil is similar to important applications– Gauss-Seidel numerical method algorithm– Smith-Waterman biological string comparison

algorithm• This stencil can not be applied to cells in top row or left

column. – These cells are handled by the update function.– To indicate that no neighbor exists for a cell update,

we pass a Nil argument to UPDATE.

Page 21: Chapter 4, CLR Textbook

Greedy Parallel Algorithm for Stencil• Consider a ring of p processors, P0, P1, … , Pp-1.

• Must decide how to allocate cells among processors.– Need to balance computational load without creating

overly expensive communications.– Assume initially that p is equal to n

– We will allocate row i of domain A to ith processor, P i.

• Declaration Needed: Var A: Array[0..n-1] of real;

– As soon as Pi has computed a cell value, it sends that value to Pi+1 (0 i < p-1).

– Initially, only A0,0 can be computed

– Once A0,0 is computed, then A1,0 and A0,1can be computed.

– The computation proceeds in steps. At step k, all values on the k-th anti-diagonal are computed.

Page 22: Chapter 4, CLR Textbook
Page 23: Chapter 4, CLR Textbook

General Steps of Greedy Algorithm

• At time i+j, processor Pi performs the following operations:

– It receives A(i-1,j) from Pi-1

– It computes A(i,j)

– Then it sends A(i,j) to Pi+1

• Exceptions:

– P0 does not need to receive cell values to update its cells.

– Pp-1 does not send its cell values after updating its cells.

• Above exceptions do not influence algorithm performance.

• This algorithm is captured in Algorithm 4.3 on next slide.

Page 24: Chapter 4, CLR Textbook
Page 25: Chapter 4, CLR Textbook

Tracing Steps in Preceding Algorithm• Re-read pgs72-73 CLR on send & receive for sych.rings.

– See slides 35-40, esp. 37-38 in slides on synchronous networks

• Steps 1-3 are performed by all processors.– All processors obtain a array A of n reals, their ID nr, and the nr of

processors.

• Steps 4-6 are preformed only by P0.

• In Step 5, P0 updates the cell A0,0 in NW top corner.

• In Step 6, P0 sends contents in A[0] of cell A0,0 to its successor, P1.

• Steps 7- 8 are executed only by P1 with since it is only processor receiving a message. (Note this is not blocking “receive”, as would block all Pi for i>1.)

• In Step 8. P1. stores update of A0,0 from P0 in address v.

• In Step 9, P0. uses value in v to update value in A[0] of cell A1,0.

Page 26: Chapter 4, CLR Textbook

Tracing Steps in Algorithm (cont)

• Steps 12-13 are executed by P0 to update the value A[j] of its next cell A0,j in top row and send its value to P1.

• Steps 14-16 are executed only by Pn-1 on bottom row to update the value A[j] of its next cell An-1,j.

– This value will be used by Pn-1 to update its next cell in the next round.

• Pn-1 does not send a value since its row is the last one.

• Only Pi for 0<i<n-1 can execute 18-19.

• In Step 18, Pi executing 18-19 on j-th loop are further restricted to those receiving a message (i.e., blocking “receive”)

• In Step 18, Pi executes the send and receive in parallel

• In Step 19, Pi uses the received value to update the value A[j] of the next cell Ai,j.

Page 27: Chapter 4, CLR Textbook

Algorithm for Fewer Processors• Typically, have much fewer processors than nr of rows.• WLOG, assume p divides n.• If n/p rows are assigned to each processor, then at least

n/p steps must occur before P0 can send a value to P1.

– This situation repeats for each Pi and Pi+1, severely restricting parallelism.

• Instead, we assign rows to processors cyclically, with row j assigned to Pj mod p.

• Each processor has following declaration:

var A: array[0...n/p, 0..n-1] of real;• This is a contiguous array of rows, but these rows are

not contiguous. • Algorithm 4.4 for the stencil application on a ring of

processors using a cyclic data distribution is given next.

Page 28: Chapter 4, CLR Textbook
Page 29: Chapter 4, CLR Textbook

Cyclic Stencil Algorithm Execution Time

• Let T(n,p) be the execution time for preceding algorithm.• We assume that “receiving“ is blocking while “sending” is

not. • The sending of a message in step k is followed by the

reception of the message in step k+1. • The time needed to perform one algorithm step is

+b+L, where– The time needed to update a cell is – The rate at which cell values are communicated is b

– The startup cost is L.

• The computation terminates when Pp-1 finishes computing the rightmost cell value of its last row of cells.

Page 30: Chapter 4, CLR Textbook

Cyclic Stencil Algorithm Run Time (cont)• Number of algorithm steps is p-1 + n2/p

– Pp-1 is idle for first p-1 steps

– Once Pp-1 starts computing, it computes a cell each step until the algorithm computation is completed.

– There are n2 cells, split evenly between the processors, so each processor is assigned n2/p cells

– This yields

• Additional problem:– Algorithm was designed to minimize (time between a cell

update computation) and (its reception by the next processor)

– However, the algorithm performs many communications of small data items.

– L can be orders of magnitude larger than b if cell value small.

2

( , ) ( 1 )( )npT n p p b L

Page 31: Chapter 4, CLR Textbook

Cyclic Stencil Algorithm Run Time (cont)

• Stencil application characteristics:– The cell value is often as small as an integer or real nr.

– The computations to update the cells may involve only a few operations, so may also be small.

– For many computations, most of the execution time could be due to the L term in the equation for T(n,p).

– Spending a large amount of time in communications overhead reduces the parallel efficiency considerably.

• Note that Ep(n) = Tseq(n) / pTpar(n) = n2w / pTpar(n)

• Ep(n) reduces to the below formula. Note that as n increases, the efficiency may drop well below 1.

2

2

( )( 1)( )

npE n

p p b Lb L

n

Page 32: Chapter 4, CLR Textbook

Augmenting Granularity of Algorithm

• The communication overhead due to startup latencies can be decreased by sending fewer messages that are larger.– Let each processor compute k contiguous cell values

in each row during each step, instead of just 1 value.– To simplify analysis, we assume k divides n, so each

row has n/k segments of k contiguous cells.• To avoid above, let the last incomplete segment

spill over to the next row. The last segment of last row may have fewer than k elements.

– With this algorithm, cell values are communicated in bulk, k at a time.

Page 33: Chapter 4, CLR Textbook

Augmenting Granularity of Algorithm (cont)

• Effect of bulk communication k items on algorithm– Larger values of k produce less communication

overhead.– However, larger values of k increase the time

between a cell value update and its reception in the next processor

– In this algorithm, processors will start computing cell values later, leading to more idle time for cells.

• This approach is illustrated in next diagram.

Page 34: Chapter 4, CLR Textbook
Page 35: Chapter 4, CLR Textbook

Block-Cyclic Allocation of Cells

• A second way to reduce communication costs is to decrease the number of cell values that are being communicated.

• This is done by allocating blocks from r consecutive rows to processors cyclically.

• To simplify the analysis, we assume rp divides n. • This idea of a block-cyclic allocation is very useful, and

is illustrated below:

Page 36: Chapter 4, CLR Textbook

Block-Cyclic Allocation of Cells (cont)• Each processor computes k contiguous cells in each row

from a block of r rows. • At each step, each processor now computes rk cells

– Note blocks are rk (r rows, k columns) in size

• Note: Only those values on the edges of the block have to be sent to other processors.– This general approach can dramatically decrease the

number of cells whose updates have to be sent to other processors.

• The algorithm for this allocation is similar to those shown for the cyclic row assignment scheme in Figure 4.6, – Simply replace “rows” by “blocks of rows”. – A processor calculates all cell values in its first block

of rows in n/k steps of the algorithm.

Page 37: Chapter 4, CLR Textbook

Block-Cyclic Allocations (cont)• Processor Pp-1 sends its k cell values to P0 after p

algorithm steps.– P0 needs these values to compute its second “block of

rows” .

– As a result, we need n kp in order to keep processors busy.

– If n > kp, then processors must temporarily store received cell values while they finishing computing their block of rows for the previous step.

• Recall processors only have to exchange data at the boundaries between blocks.

• Using r rows per block, the amount of data communicated is r times smaller than the previous algorithm.

Page 38: Chapter 4, CLR Textbook

Block-Cyclic Allocations (cont)• Processor activities in computing block:

– Receive k cell values from predecessor– Compute kr cell values– Sends k cell values to its successor

• Again we assume “receives” are blocking while “sends” are not.

• The time required to perform one step of algorithm is

krw+kb+L

• The computation finishes when processor Pp-1 finishes computing its rightmost segment in its last block of rows of cells.

• Pp-1 computes one segment of a block row in each step

Page 39: Chapter 4, CLR Textbook

Optimizing Block-Cyclic Allocations • There are n2/(kr) such segments and so p processors

can compute them in n2/(pkr) steps

• It takes p-1 algorithm steps before processor Pp-1 can start doing any computation.

• Afterwards, Pp-1 will computer one segment at each step.

• Overall, the algorithm runs for p-1+n2/pkr steps with a total computation time of

• The efficiency of this algorithm is

2

( , , , ) ( 1 )( )n

T n p r k p krw kb Lpkr

2

( , , , )seq

par

t nE

p t p T n p r k

Page 40: Chapter 4, CLR Textbook

Optimizing Block-Cyclic Allocations (cont) • However, this gives the asymptotic efficiency of

• Note that by increasing r and k, it is possible to achieve significantly higher efficiency.

• However, increasing r and k reduces communications.• The text also outlines how to determine optimal values

for k and r using a fair amount of mathematics.

2

2

1( )( )

n

b Lp pkrw kb L

r rkn rk

Page 41: Chapter 4, CLR Textbook

Implementing Logical Topologies• Designers of parallel algorithms should choose the logical

topology.

• In section 4.5, switching the topology from unidirectional ring to a bidirectional ring made the program much simpler and lowered the communications time.

• The message passing libraries, such as the ones implemented for MPI, allow communications between any two processors using the Send and Recv functions.

• Using a logical topology restricts communications to only a few paths, which usually makes the algorithm design simpler.

• The logical topology can be implemented by creating a set of functions that allows each processor to identify its neighbors. – Unidirectional Ring only needs NextNode(P)

– Bidirectional Ring would need also need PreviousNode(P)

Page 42: Chapter 4, CLR Textbook

Logical Topologies (cont)• Some systems (e.g., many modern supercomputers)

provide many physical networks, but sometimes creation of logical topologies left to the user.

• A difficult task is matching the logical topology to the physical topology for the application.

• The common wisdom is that a local topology that resembles the physical topology of application should produce a good performance.

• Sometime the reason for using a logical topology is to hide the complexity of the physical topology.

• Often extensive benchmarking is required to determine the best topology for a given algorithm on a given platform.

• The local topologies studied in this chapter and the next are known to be useful in the majority of scenarios.

Page 43: Chapter 4, CLR Textbook

Distributed vs Centralized Implementations

• In the CLR text, the data is already distributed among the processors at the start of the execution.

• One may wonder how the data was distributed to the processors if whether that should also be part of the algorithm.

• There are two approaches: Distributed & Centralized.• In the centralized approach,one assumes that the data

resides in a single “master” location.– A single processor

– A file on a disk, if data size is large.

• The CLR book takes the distributed approach. The Akl book usually takes the distributed approach, but occasionally takes the centrailized approach.

Page 44: Chapter 4, CLR Textbook

Distributed vs Centralized (cont)• An advantage of the centralized approach is that the library

routine can choose the data distribution scheme to enforce.

• The best performance requires that the choice for each algorithm consider its underlying topology.– This cannot be done in advance

• Often the library developer will provide multiple versions of possible data distribution– The user can then choose the version that bet fits the

underlying platform.

– This choice may be difficult without extensive benchmarking.

• The main disadvantage of the centralized approach is when user applies successive algorithms using the same data.– Data will be repeatedly distributed & undistributed.– Causes most library developers to opt for distributed option.

Page 45: Chapter 4, CLR Textbook

Summary of Algorithmic Principles(For Asynchronous Message Passing)

Although used only for ring topology, the below principles are general. Unfortunately, they often conflict with each other.

• Sending data in bulk– Reduces communication overhead due to network

latencies

• Sending data early– Sending data as early as possible allows other

processors to start computing as early as possible.

Page 46: Chapter 4, CLR Textbook

Summary of Algorithmic Principles(For Asynchronous Message Passing)

-- Continued --

• Overlapping communication and computation– If both can be performed at the same time, the

communication cost is often hidden

• Block Data Distribution– Having processors assigned blocks of contiguous

data elements reduces the amount of communication

• Cyclic Data Distribution– Having data elements interleaved among processors

makes it possible to reduce idle time and achieve a better load balance