A.Broumandnia, [email protected] 1 5 PRAM and Basic Algorithms Topics in This Chapter 5.1 PRAM...
-
Upload
theodora-theresa-taylor -
Category
Documents
-
view
215 -
download
0
Transcript of A.Broumandnia, [email protected] 1 5 PRAM and Basic Algorithms Topics in This Chapter 5.1 PRAM...
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
1
5 PRAM and Basic Algorithms
Topics in This Chapter
5.1 PRAM Submodels and Assumptions
5.2 Data Broadcasting
5.3 Semigroup or Fan-in Computation
5.4 Parallel Prefix Computation
5.5 Ranking the Elements of a Linked List
5.6 Matrix Multiplication
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
5.1 PRAM Submodels and Assumptions
2
Fig. 4.6 Conceptual view of a parallel random-access machine (PRAM).
Processors
.
.
.
Shared Memory
0
1
p–1
.
.
.
0
1
2
3
m–1
Processor i can do the following in three phases of one cycle:
1. Fetch a value from address si in shared memory
2. Perform computations on data held in local registers
3. Store a value into address di in shared memory
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
5.1 PRAM Submodels and Assumptions
• Because the addresses si and di are determined by Processor i, independently of all other processors, it is possible that several processors may want to read data from the same memory location or write their values into a common location. Hence, four submodels of the PRAM model have been defined based on whether concurrent reads (writes) from (to) the same location are allowed. The four possible combinations, depicted in Fig. 5.1, are EREW: Exclusive-read, exclusive-write ERCW: Exclusive-read, concurrent-write CREW: Concurrent-read, exclusive-write CRCW: Concurrent-read, concurrent-write
3
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
5.1 PRAM Submodels and Assumptions
4
Fig. 5.1 Submodels of the PRAM model.
EREW Least “powerful”, most “realistic”
CREW Default
ERCW Not useful
CRCW Most “powerful”,
further subdivided
Reads from same location
Wri
tes
to s
am
e lo
catio
n
Exclusive
Co
ncur
rent
Concurrent
Exc
lusi
ve
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
Types of CRCW PRAM • CRCW PRAM is further classified according to how concurrent writes
are handled. Here are a few example submodels based on the semantics of concurrent writes in CRCW PRAM: Undefined: In case of multiple writes, the value written is undefined
(CRCW-U). Detecting: A special code representing “detected collision” is written
(CRCW-D). Common: Multiple writes allowed only if all store the same value (CRCW-
C). This is sometimes called the consistent-write submodel. Random: The value written is randomly chosen from among those offered
(CRCWR). This is sometimes called the arbitrary-write submodel. Priority: The processor with the lowest index succeeds in writing its value
(CRCW-P). Max/Min: The largest/smallest of the multiple values is written (CRCW-M). Reduction: The arithmetic sum (CRCW-S), logical AND (CRCW-A), logical
XOR (CRCW-X), or some other combination of the multiple values is written.
5
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
Power of PRAM Submodels • These submodels are all different from each other and from EREW
and CREW.• One way to order these submodels is by their computational power.• The following relationships have been established between some of
the PRAM submodels:
6
EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
Some Elementary PRAM Computations
7
Initializing an n-vector (base address = B) to all 0s:
for j = 0 to n/p – 1 processor i doif jp + i < n then M[B + jp + i] := 0
endfor
Adding two n-vectors and storing the results in a third (base addresses B, B, B)
Convolution of two n-vectors: Wk = i+j=k Ui Vj
(base addresses BW, BU, BV)
n / psegments
pelements
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
5.2 Data Broadcasting• Simple, or one-to-all, broadcasting is used when one processor
needs to send a data value to all other processors.• In the CREW or CRCW submodels, broadcasting is trivial, as the
sending processor can write the data value into a memory location, with all processors reading that data value in the following machine cycle.
• Thus, simple broadcasting is done in Θ(1) steps.• Multicasting within groups is equally simple if each processor knows
its group membership(s) and only members of each group read the multicast data for that group.
• All-to-all broadcasting, where each of the p processors needs to send a data value to all other processors, can be done through p separate broadcast operations in Θ(p) steps, which is optimal.
• The above scheme is clearly inapplicable to broadcasting in the EREW model.
8
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
5.2 Data Broadcasting• The simplest scheme for EREW broadcasting is to make p
copies of the data value, say in a broadcast vector B of length p, and then let each processor read its own copy by accessing B[j]. Initially, Processor i writes its data value into B[0]. Next, a method known as recursive doubling is used to copy B[0]
into all elements of B steps. Finally, Processor j, 0 ≤ j < p, reads B [j] to get the data value
broadcast by Processor i.
9
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
5.2 Data Broadcasting
10Fig. 5.2 Data broadcasting in EREW PRAM via recursive doubling.
Making p copies of B[0] by recursive doublingfor k = 0 to log2 p – 1 Proc j, 0 j < p, do Copy B[j] into B[j + 2k] endfor
0 1 2 3 4 5 6 7 8 9 10 11
B
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
5.2 Data Broadcasting
11
EREW PRAM algorithm for broadcasting by Processor iProcessor i write the data value into B[0]s := 1while s < p Processor j, 0 ≤ j < min(s, p – s), doCopy B[j] into B[j + s]s := 2sendwhileProcessor j, 0 ≤ j < p, read the data value in B[j]
Fig. 5.3 EREW PRAM data broadcasting without redundant copying.
0 1 2 3 4 5 6 7 8 9 10 11
B
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
All-to-All Broadcasting on EREW PRAM
• To perform all-to-all broadcasting, so that each processor broadcasts a value that it holds to each of the other p – 1 processors, we let Processor j write its value into B[j], rather than into B[0].
• Thus, in one memory access step, all of the values to be broadcast are written into the broadcast vector B.
• Each processor then reads the other p – 1 values in p – 1 memory accesses.
• To ensure that all reads are exclusive, Processor j begins reading the values starting with B[j + 1], wrapping around to B[0] after reading B[p – 1].
12
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
All-to-All Broadcasting on EREW PRAM
13
EREW PRAM algorithm for all-to-all broadcastingProcessor j, 0 j < p, write own data value into B[j]for k = 1 to p – 1 Processor j, 0 j < p, do Read the data value in B[(j + k) mod p]endfor
This O(p)-step algorithm is time-optimal
j
p – 1
0
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
All-to-All Broadcasting on EREW PRAM
14
Naive EREW PRAM sorting algorithm (using all-to-all broadcasting)Processor j, 0 j < p, write 0 into R[ j ] for k = 1 to p – 1 Processor j, 0 j < p, do l := (j + k) mod p if S[ l ] < S[ j ] or S[ l ] = S[ j ] and l < j then R[ j ] := R[ j ] + 1 endifendforProcessor j, 0 j < p, write S[ j ] into S[R[ j ]]
This O(p)-step sorting algorithm is far from optimal; sorting is possible in O(log p) time
Because each data element must be given a unique rank, ties are broken by using the processor ID. In other words, if Processors i and j (i < j) hold equal data values, the value in Processor i is deemed smaller for ranking purposes.This sorting algorithm is not optimal in that the O(p²) computational work involved in it is significantly greater than the O(p log p) work required for sorting p elements on a single processor.
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
5.3 Semigroup or Fan-in Computation
• This computation is trivial for a CRCW PRAM of the “reduction” variety if the reduction operator happens to be . ⊗
• For example, computing the arithmetic sum (logical AND, logical XOR) of p values, one per processor, is trivial for the CRCW-S (CRCW-A, CRCW-X) PRAM; it can be done in a single cycle by each processor writing its corresponding value into a common location that will then hold the arithmetic sum (logical AND, logical XOR) of all of the values.
15
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
5.3 Semigroup or Fan-in Computation
16
EREW PRAM semigroup computation algorithmProc j, 0 j < p, copy X[j] into S[j]s := 1while s < p Proc j, 0 j < p – s, do S[j + s] := S[j] S[j + s] s := 2sendwhileBroadcast S[p – 1] to all processors
Fig. 5.4 Semigroup computation in EREW PRAM.
0 1 2 3 4 5 6 7 8 9
S 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7 8:8 9:9
0:0 0:1 1:2 2:3 3:4 4:5 5:6 6:7 7:8 8:9
0:0 0:1 0:2 0:3 1:4 2:5 3:6 4:7 5:8 6:9
0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 1:8 2:9
0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 0:8 0:9
This algorithm is optimal for PRAM, but its speedup of O(p / log p) is not
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
5.3 Semigroup or Fan-in Computation
• When each of the p processors is in charge of n/p elements, rather than just one element, the semigroup computation is performed by each processor 1. first combining its n/p elements in n/p steps to get a single
value. 2. Then, the algorithm just discussed is used, with the first step
replaced by copying the result of the above into S[j].• It is instructive to evaluate the speed-up and efficiency of the
above algorithm for an n-input semigroup computation using p processors. (n=p and n>p)(page 97)
17
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
5.4 Parallel Prefix Computation• We see in Fig. 5.4 that as we find the semigroup computation
result in S[p – 1], all partial prefixes are also obtained in the previous elements of S. Figure 5.6 is identical to Fig. 5.4, except that it includes shading to show that the number of correct prefix results doubles in each step.
18
Fig. 5.6 Parallel prefix computation in EREW PRAM via recursive doubling.
0 1 2 3 4 5 6 7 8 9
S 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7 8:8 9:9
0:0 0:1 1:2 2:3 3:4 4:5 5:6 6:7 7:8 8:9
0:0 0:1 0:2 0:3 1:4 2:5 3:6 4:7 5:8 6:9
0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 1:8 2:9
0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 0:8 0:9
Same as the first part of semigroup computation (no final broadcasting)
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
5.4 Parallel Prefix Computation
• The previous algorithm is quite efficient, but there are other ways of performing parallel prefix computation on the PRAM. In particular, the divide-and-conquer paradigm leads to two other solutions to this problem.
• In the following, we deal only with the case of a p-input problem, where p (the number of inputs or processors) is a power of 2.
• As in Section 5.3, the pair of integers u:v represents the combination (e.g., sum) of all input values from xu to xv .
• Figure 5.7 depicts our first divide-and-conquer algorithm. • We view the problem as composed of two subproblems:
computing the odd-indexed results s1 , s3, s5, . . . And computing the even-indexed results s0, s2, s4 , . . . .
19
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
First Divide-and-Conquer Parallel-Prefix Algorithm
20
x 0 x 1 x 2 x 3 x n-1 x n-2
0:0 0:1
0:2 0:3
0:n-2 0:n-1
Parallel prefix computation of size n/2 In hardware, this
is the basis for Brent-Kung carry-lookahead adder
T(p) = T(p/2) + 2
T(p) 2 log2 p
Fig. 5.7 Parallel prefix computation using a divide-and-conquer scheme.
Each vertical line represents a location in shared memory
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
Second Divide-and-Conquer Parallel-Prefix Algorithm
• Figure 5.8 depicts a second divide-and-conquer algorithm. • We view the input list as composed of two sublists: the
even-indexed inputs x0, x2, x4, . . . and the odd-indexed inputs x1, x3, x5, . . . . Parallel prefix computation is performed separately on each sublist, leading to partial results as shown in Fig. 5.8 (a sequence of digits indicates the combination of elements with those indices).
• The final results are obtained by pairwise combination of adjacent partial results in a single PRAM step.
• The total computation time is given by the recurrence T(p) = T(p/2) + 1
• Even though this latter algorithm is more efficient than the first divide-and-conquer scheme, it is applicable only if the operator
is commutative (why?).⊗
21
A.Br
oum
andn
ia, B
roum
andn
ia@
gm
ail.c
om
Second Divide-and-Conquer Parallel-Prefix Algorithm
22
x 0 x 1 x 2 x 3 x n-1 x n-2
0:0 0:1
0:2 0:3
0:n-2 0:n-1
Parallel prefix computation on n/2 odd-indexed inputs
Parallel prefix computation on n/2 even-indexed inputs
Strictly optimal algorithm, but requires commutativity
T(p) = T(p/2) + 1
T(p) = log2 p
Each vertical line represents a location in shared memory
Fig. 5.8 Another divide-and-conquer scheme for parallel prefix computation.