A.Broumandnia, [email protected] 1 5 PRAM and Basic Algorithms Topics in This Chapter 5.1 PRAM...

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

1

5 PRAM and Basic Algorithms

Topics in This Chapter

5.1 PRAM Submodels and Assumptions

5.2 Data Broadcasting

5.3 Semigroup or Fan-in Computation

5.4 Parallel Prefix Computation

5.5 Ranking the Elements of a Linked List

5.6 Matrix Multiplication

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om


2

Fig. 4.6 Conceptual view of a parallel random-access machine (PRAM).

Processors

.

.

.

Shared Memory

0

1

p–1

.

.

.

0

1

2

3

m–1

Processor i can do the following in three phases of one cycle:

1. Fetch a value from address si in shared memory

2. Perform computations on data held in local registers

3. Store a value into address di in shared memory

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om


• Because the addresses si and di are determined by Processor i, independently of all other processors, it is possible that several processors may want to read data from the same memory location or write their values into a common location. Hence, four submodels of the PRAM model have been defined based on whether concurrent reads (writes) from (to) the same location are allowed. The four possible combinations, depicted in Fig. 5.1, are EREW: Exclusive-read, exclusive-write ERCW: Exclusive-read, concurrent-write CREW: Concurrent-read, exclusive-write CRCW: Concurrent-read, concurrent-write

3

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om


4

Fig. 5.1 Submodels of the PRAM model.

EREW Least “powerful”, most “realistic”

CREW Default

ERCW Not useful

CRCW Most “powerful”,

further subdivided

Reads from same location

Wri

tes

to s

am

e lo

catio

n

Exclusive

Co

ncur

rent

Concurrent

Exc

lusi

ve

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Types of CRCW PRAM • CRCW PRAM is further classified according to how concurrent writes

are handled. Here are a few example submodels based on the semantics of concurrent writes in CRCW PRAM: Undefined: In case of multiple writes, the value written is undefined

(CRCW-U). Detecting: A special code representing “detected collision” is written

(CRCW-D). Common: Multiple writes allowed only if all store the same value (CRCW-

C). This is sometimes called the consistent-write submodel. Random: The value written is randomly chosen from among those offered

(CRCWR). This is sometimes called the arbitrary-write submodel. Priority: The processor with the lowest index succeeds in writing its value

(CRCW-P). Max/Min: The largest/smallest of the multiple values is written (CRCW-M). Reduction: The arithmetic sum (CRCW-S), logical AND (CRCW-A), logical

XOR (CRCW-X), or some other combination of the multiple values is written.

5

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Power of PRAM Submodels • These submodels are all different from each other and from EREW

and CREW.• One way to order these submodels is by their computational power.• The following relationships have been established between some of

the PRAM submodels:

6

EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Some Elementary PRAM Computations

7

Initializing an n-vector (base address = B) to all 0s:

for j = 0 to n/p – 1 processor i doif jp + i < n then M[B + jp + i] := 0

endfor

Adding two n-vectors and storing the results in a third (base addresses B, B, B)

Convolution of two n-vectors: Wk = i+j=k Ui Vj

(base addresses BW, BU, BV)

n / psegments

pelements

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

5.2 Data Broadcasting• Simple, or one-to-all, broadcasting is used when one processor

needs to send a data value to all other processors.• In the CREW or CRCW submodels, broadcasting is trivial, as the

sending processor can write the data value into a memory location, with all processors reading that data value in the following machine cycle.

• Thus, simple broadcasting is done in Θ(1) steps.• Multicasting within groups is equally simple if each processor knows

its group membership(s) and only members of each group read the multicast data for that group.

• All-to-all broadcasting, where each of the p processors needs to send a data value to all other processors, can be done through p separate broadcast operations in Θ(p) steps, which is optimal.

• The above scheme is clearly inapplicable to broadcasting in the EREW model.

8

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

5.2 Data Broadcasting• The simplest scheme for EREW broadcasting is to make p

copies of the data value, say in a broadcast vector B of length p, and then let each processor read its own copy by accessing B[j]. Initially, Processor i writes its data value into B[0]. Next, a method known as recursive doubling is used to copy B[0]

into all elements of B steps. Finally, Processor j, 0 ≤ j < p, reads B [j] to get the data value

broadcast by Processor i.

9

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om


10Fig. 5.2 Data broadcasting in EREW PRAM via recursive doubling.

Making p copies of B[0] by recursive doublingfor k = 0 to log2 p – 1 Proc j, 0 j < p, do Copy B[j] into B[j + 2k] endfor

0 1 2 3 4 5 6 7 8 9 10 11

B

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om


11

EREW PRAM algorithm for broadcasting by Processor iProcessor i write the data value into B[0]s := 1while s < p Processor j, 0 ≤ j < min(s, p – s), doCopy B[j] into B[j + s]s := 2sendwhileProcessor j, 0 ≤ j < p, read the data value in B[j]

Fig. 5.3 EREW PRAM data broadcasting without redundant copying.

0 1 2 3 4 5 6 7 8 9 10 11

B

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

All-to-All Broadcasting on EREW PRAM

• To perform all-to-all broadcasting, so that each processor broadcasts a value that it holds to each of the other p – 1 processors, we let Processor j write its value into B[j], rather than into B[0].

• Thus, in one memory access step, all of the values to be broadcast are written into the broadcast vector B.

• Each processor then reads the other p – 1 values in p – 1 memory accesses.

• To ensure that all reads are exclusive, Processor j begins reading the values starting with B[j + 1], wrapping around to B[0] after reading B[p – 1].

12

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om


13

EREW PRAM algorithm for all-to-all broadcastingProcessor j, 0 j < p, write own data value into B[j]for k = 1 to p – 1 Processor j, 0 j < p, do Read the data value in B[(j + k) mod p]endfor

This O(p)-step algorithm is time-optimal

j

p – 1

0

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om


14

Naive EREW PRAM sorting algorithm (using all-to-all broadcasting)Processor j, 0 j < p, write 0 into R[ j ] for k = 1 to p – 1 Processor j, 0 j < p, do l := (j + k) mod p if S[ l ] < S[ j ] or S[ l ] = S[ j ] and l < j then R[ j ] := R[ j ] + 1 endifendforProcessor j, 0 j < p, write S[ j ] into S[R[ j ]]

This O(p)-step sorting algorithm is far from optimal; sorting is possible in O(log p) time

Because each data element must be given a unique rank, ties are broken by using the processor ID. In other words, if Processors i and j (i < j) hold equal data values, the value in Processor i is deemed smaller for ranking purposes.This sorting algorithm is not optimal in that the O(p²) computational work involved in it is significantly greater than the O(p log p) work required for sorting p elements on a single processor.

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om


• This computation is trivial for a CRCW PRAM of the “reduction” variety if the reduction operator happens to be . ⊗

• For example, computing the arithmetic sum (logical AND, logical XOR) of p values, one per processor, is trivial for the CRCW-S (CRCW-A, CRCW-X) PRAM; it can be done in a single cycle by each processor writing its corresponding value into a common location that will then hold the arithmetic sum (logical AND, logical XOR) of all of the values.

15

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om


16

EREW PRAM semigroup computation algorithmProc j, 0 j < p, copy X[j] into S[j]s := 1while s < p Proc j, 0 j < p – s, do S[j + s] := S[j] S[j + s] s := 2sendwhileBroadcast S[p – 1] to all processors

Fig. 5.4 Semigroup computation in EREW PRAM.

0 1 2 3 4 5 6 7 8 9

S 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7 8:8 9:9

0:0 0:1 1:2 2:3 3:4 4:5 5:6 6:7 7:8 8:9

0:0 0:1 0:2 0:3 1:4 2:5 3:6 4:7 5:8 6:9

0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 1:8 2:9

0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 0:8 0:9

This algorithm is optimal for PRAM, but its speedup of O(p / log p) is not

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om


• When each of the p processors is in charge of n/p elements, rather than just one element, the semigroup computation is performed by each processor 1. first combining its n/p elements in n/p steps to get a single

value. 2. Then, the algorithm just discussed is used, with the first step

replaced by copying the result of the above into S[j].• It is instructive to evaluate the speed-up and efficiency of the

above algorithm for an n-input semigroup computation using p processors. (n=p and n>p)(page 97)

17

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

5.4 Parallel Prefix Computation• We see in Fig. 5.4 that as we find the semigroup computation

result in S[p – 1], all partial prefixes are also obtained in the previous elements of S. Figure 5.6 is identical to Fig. 5.4, except that it includes shading to show that the number of correct prefix results doubles in each step.

18

Fig. 5.6 Parallel prefix computation in EREW PRAM via recursive doubling.

0 1 2 3 4 5 6 7 8 9

S 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7 8:8 9:9

0:0 0:1 1:2 2:3 3:4 4:5 5:6 6:7 7:8 8:9

0:0 0:1 0:2 0:3 1:4 2:5 3:6 4:7 5:8 6:9

0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 1:8 2:9

0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 0:8 0:9

Same as the first part of semigroup computation (no final broadcasting)

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

5.4 Parallel Prefix Computation

• The previous algorithm is quite efficient, but there are other ways of performing parallel prefix computation on the PRAM. In particular, the divide-and-conquer paradigm leads to two other solutions to this problem.

• In the following, we deal only with the case of a p-input problem, where p (the number of inputs or processors) is a power of 2.

• As in Section 5.3, the pair of integers u:v represents the combination (e.g., sum) of all input values from xu to xv .

• Figure 5.7 depicts our first divide-and-conquer algorithm. • We view the problem as composed of two subproblems:

computing the odd-indexed results s1 , s3, s5, . . . And computing the even-indexed results s0, s2, s4 , . . . .

19

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

First Divide-and-Conquer Parallel-Prefix Algorithm

20

x 0 x 1 x 2 x 3 x n-1 x n-2

0:0 0:1

0:2 0:3

0:n-2 0:n-1

Parallel prefix computation of size n/2 In hardware, this

is the basis for Brent-Kung carry-lookahead adder

T(p) = T(p/2) + 2

T(p) 2 log2 p

Fig. 5.7 Parallel prefix computation using a divide-and-conquer scheme.

Each vertical line represents a location in shared memory

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Second Divide-and-Conquer Parallel-Prefix Algorithm

• Figure 5.8 depicts a second divide-and-conquer algorithm. • We view the input list as composed of two sublists: the

even-indexed inputs x0, x2, x4, . . . and the odd-indexed inputs x1, x3, x5, . . . . Parallel prefix computation is performed separately on each sublist, leading to partial results as shown in Fig. 5.8 (a sequence of digits indicates the combination of elements with those indices).

• The final results are obtained by pairwise combination of adjacent partial results in a single PRAM step.

• The total computation time is given by the recurrence T(p) = T(p/2) + 1

• Even though this latter algorithm is more efficient than the first divide-and-conquer scheme, it is applicable only if the operator

is commutative (why?).⊗

21

A.Br

oum

andn

ia, B

roum

andn

ia@

gm

ail.c

om

Second Divide-and-Conquer Parallel-Prefix Algorithm

22

x 0 x 1 x 2 x 3 x n-1 x n-2

0:0 0:1

0:2 0:3

0:n-2 0:n-1

Parallel prefix computation on n/2 odd-indexed inputs

Parallel prefix computation on n/2 even-indexed inputs

Strictly optimal algorithm, but requires commutativity

T(p) = T(p/2) + 1

T(p) = log2 p

Each vertical line represents a location in shared memory

Fig. 5.8 Another divide-and-conquer scheme for parallel prefix computation.

A.Broumandnia, [email protected] 1 5 PRAM and Basic Algorithms Topics in This Chapter 5.1 PRAM...

Documents

Transcript of A.Broumandnia, [email protected] 1 5 PRAM and Basic Algorithms Topics in This Chapter 5.1 PRAM...