Order statistics on a linear array with a reconfigurable bus

l:I,Sl:\ IER Future Generation Computer Systems 11 (1995) 321-327

Order statistics on a linear array with a reconfigurable bus

Yi Pan *

Department of Computer Science, University of Dayton, Dayton, OH 45469-2160, USA

Abstract

The order statistics problem is considered in this paper. We present a parallel algorithm to find the smallest (or largest) kth element in a set of N totally ordered (but not sorted) data items. This algorithm runs in O(log* N) expected time on a reconfigurable linear array with N processors and a constant amount of memory processor. We also show that this algorithm can be generalized to process an oversized order efficiently.

Keywords: Complexity; Linear array; Parallel algorithm; Reconfigurable bus; Selection

1. Introduction

The problem of selection has a number of applications in computer science, computational geometry, and statistics. In statistics, selection is referred to as the computation of order stutistics. In particular, computing the median element of a set of data is a standard procedure in statistics analysis. Selection also has application in image analysis. Selecting the peak values after the Hough transform is one of them. In a database context, selection amounts to answering a query on the collection X of records: Many algorithms such as parallel merging, sorting, and convex hull computation, use selection as a procedure [4]. Selection can be stated formally as follows. Given a list X of N elements whose elements are in random order and an integer satisfying 1 < k I N,

* Email: [email protected]

find the kth smallest element in X. Many parallel selection algorithms have been designed on different models to speedup its computation. Par- allel selection algorithms on shared memory models were discussed in [3,8]. A number of algorithms exist for selection on a trek-connected computer [1,18]. The selection problem has also been tackled on variants of basic mesh-connected models. An algorithms has been proposed in [17] that runs on a mesh-connected mesh with a broadcast capability. Chen et al. [7] showed how to compute the median on a mesh with multiple broadcasting. A result similar to that ~ is described in [61. An improved selection algorithm on a mesh with multiple broadcasting is presented in [5]. A selection algorithm on a twojdimensional (2-D) reconfigurable mesh has been proposed by ElGindy and Wegrowicz [9]. Olariu et al. gave a simpler selection algorithm on the same model in [14]. In this paper, a new and efficient parallel algorithm for solving the selection1 problem is

0167-739X/95/$09.50 0 1995 Elsevier Science B.V. All rights reserved SSDIO167-739X(94)00066-2

322 X Pan /Future Generation Computer Systems 11 (1995) 321-327

proposed for a linear array with a reconfigurable reconfigurable bus makes it eminently suitable bus. The algorithm runs in 0(log2 N) expected for VLSI implementation. In fact, it has been time and uses a constant amount of memory argued [13] that arrays with a reconfigurable bus space in each processor. To the best of our can be used as a universal chip capable of simu- knowledge, our selection algorithm is the first lating any equivalent-area architecture without one on a linear array. loss of time.

2. The linear array with a reconfigurable bus

It is well-known that interprocessor communications and simultaneous memory accesses often act as bottlenecks in parallel computers. In a processor array connected by point-to-point links, its computation time is lower-bounded by its di- ameter. Thus, in a linear array of size N with two neighboring processors connected by a direct link, the time to compute the selection problem is at least O(N), the same order of magnitude as the best sequential selection algorithm [2]. To cir- cumvent this problem, reconfigurable bus systems have recently been added to a number of parallel computers [10,12,15]. It has been pointed out [12,13] that the regular structure of arrays with a

In this paper, the computation model considered is a linear array with a reconfigurable bus. In such a model, each processor is similar to a stand-alone RAM with its own local memory and can perform basic arithmetic and logical operations. The array operates in Single Instruction Multiple Data (SIMD) mode. The ID of each processor is an integer (i) where 0 I i <N. The ID of the first processor is 0 and that of the last one is N - 1. The two switches associated with a processor are labeled E (east) and W (west), respectively. Two processors can simultaneously set (connect) or unset (disconnect) a particular switch as long as the settings do not conflict. By adjusting the local connections within each processor several disjoint subbuses can be established. Two subbuses are said to be disjoint if they have no common ports. Only one processor

(a) A linear array with four subbuses

(b) A linear array with two subbuses

(c) A linear array with a global bus

I I -8 -o- - processor connected switch disconnected switch

Fig. 1. A linear array with a reconfigurable bus.

Y. Pan /Future Generation Computer Systems I I (I 995) 321-327 323

can put its data item onto a given subbus at any time. In unit time, data put on a subbus can be read by every processor connected to it. A global bus is established when all the subbuses are connected together. Subbuses can be dynamically established to suit computational needs. Fig. 1 shows a linear array with several bus configura- tions.

element of A. In every iteration, we partition the set C into three disjoint subsets

C,={cECIc<m},

C,=(cECIc=m},

C,={cECIc>m},

Many researchers have reservation about the assumption that the time to transmit a signal along a bus is a constant, regardless of the number of switches through which the signal propa- gates. Although the delay incurred in traversing a switch is non-zero, insight from recent VLSI im- plementations has demonstrated that, even with today’s technology, the delay is quite small. For example, the delay noticed on a YUPPIE chip is about l/16 ns to 1 ns [ll]. This delay is even shorter on a recent chip called GCN, which adopts precharged circuits [161. These experiments con- firm the feasibility and potential benefits of reconfigurable bus systems. We believe that the above assumption is reasonable when a linear array contains a few thousands of processors.

where m is a distinguished element oft C. In case I C, I r k, we eliminate C, ‘nd C, and

proceed recursively to solve the prob em al of se- letting the kth smallest element in

t

r; In case I C, I < k and I C, u C, I 2 k, the desir d element is m; finally, if I C, U C, I <k then

7 e proceed

recursively to select the (k - I C, u C,ml)th smallest element in C,. In this manner we an replace

‘b the given problem by a smaller pro lem. This process is continued until the second condition above is satisfied and we find the kth smallest element in the set A.

3. The selection algorithm

The whole algorithm is spelled out in algorithm SELECT. In the algorithm, a current set of candidates is maintained through a local binary number IN. If the local IN of a processor is 1, the local data item is in the current set and the processor is active and will participate in future selection processes; otherwise, proc ssors

t” are

passive in the sense that their local data items will not be considered further.

Consider a linear SIMD array with a reconfigurable bus of size N. The input is assumed to be a collection A of N integer numbers. We wish to select the k th smallest element in A. We can assume the numbers are unique without loss of generality since if we are given arbitrary numbers

x0, Xl,’ * a, x~._~ we can replace xi by (xi, i) and define an order of the tuples by (xi, i) < (xi, j) if xi <xi or if xi =xj and i <j. Clearly, the time complexity of the algorithm will remain the same after the replacement and the definition of the new order for these integer numbers.

Algorithm SELECT@, k) Input: A data vector D of length /V and an integer k are distributed in a linear array with a reconfigurable bus of size N; i.e. each processor contains a data item of the vector D and the integer k. Also assume that initially, ~a11 processors participate in the selection proces$; i.e. IN(i) =lforO<i~N-1. Output: the kth smallest element of the data vector D is found and stored in memory cell m of all processors.

Now, let us describe our selection algorithm. (1) In this step, we want to select a di tinguished The selection algorithm proceeds along lines sim- number so that we can divide the urrent set ilar to those in [2]. The divide-and-conquer strat- into three subsets. Clearly, the b di tinguished egy is applied to solve the selection problem number must be a data item in a bctive pro- efficiently. Every iteration of the algorithm in- cessor. This is done as follows. The linear volves a current set C of candidates, that is, array with a reconfigurable bus is connected elements of the original input set A that have a into a single bus. Assume that processor 0 is chance of being selected as the k th smallest the first processor and processor Z$’ - 1 is the

324 Y. Pan /Future Generation Computer Systems 11 (1995) 321-327

last processor on the single bus. AI1 processors close their switches to their predecessors so that they can receive a signal from their predecessors. Then, a processor disconnects its switch to its successor if its local IN = 1; i.e. all active processors do not send a ‘1’ signal to their successors. If IN(O) = 1, then processor 0 is the first alive processor on the bus and the distinguished number m is the data item in processor 0. Otherwise, processor 0 sends a value ‘1’ to its successor. The processor which receives a ‘1’ is the first processor, on the single bus, which is still alive. Thus, its data item can be used as the distinguished number m.

(2) PE which is selected in Step 1 broadcasts its data item D and all processors put the received data into their local memory m.

(3) In this step, all active processors compare the received value m with local data item D; If m > D, set B to 0, indicating that the local data item is in C,; otherwise, set B to 1, meaning that the local data item is in C, or

c,. (4) Perform a binary summation over B and IN

cross the whole array, and put the two sums in s and t, respectively. Clearly, s = 1 C, 1 + 1 C, I is the number of elements in C, or C,. In other words, s is the number of data elements, in the current set of candidates, which are larger than or equal to the distinguished number m; and t is the size of the current set of candidates.

(5) Calculate I C, I = t -s, I C, I = 1, and I C, I = s. Here, I Ci I is the size of the set Ci for i = 1,2,3.

(6) If I C, I 2 k, then the kth smallest element is in C,. Hence, we set IN = 0 for all those processors whose B = 1 to eliminate C, and C,. Also change their local B's to 0. In this way, these processors become passive and will not participate in future computation. Then, we recursively call SELECT(D, k). If 1 C, I < k and I C, I + I C, I 2 k, then the kth smallest is the distinguished number m, stop. If I Cl I + I C, I <k, then the kth smallest element is in C,. Hence, we set IN = 0 for all those processors whose local B = 0 or whose

data item is m to eliminate data elements in C, and C,. Then, we recursively call SE- LECRD, k - ( I C, I + I C, IN

4. Time analysis

In this section, we want to calculate the average time of the above algorithm. Time complexity for parallel algorithms include the time for data communication and the time for local computation. In this particular model, we assume that all bus communications and local computations take a constant time. Before we can talk about the expected running time of an algorithm, we must agree on what the probability distribution of the inputs is. For selection, a natural assumption, and the one we shall make, is that every permutation of the set of numbers to be selected is equally likely to appear as an input.

Since the above algorithm runs recursively, we need to figure out the total time for each iteration first. In Step one, O(1) time is needed since this is basically a broadcast operation plus some local switch settings. Similarly, Steps two and three use O(1) time. Step four contains two binary summation operations across the whole array. We can perform a binary summation operation on the array as follows. In phase one, subbuses of length 2 are formed and additions are carried out for pairs of processors connected by the subbuses. The sum of each pair is store in one of the two processors. In phase two, subbuses of length 4 are formed and the additions on the two partial sums obtained in phase one are carried out. In general, in phase i, subbuses of length 2’ are established and additions are performed on partial sums obtained in phase i - 1. Fig. 1 shows the bus connections of the above process for an array with eight processors. Since we need log N phases to complete the summation and each phase takes a constant time, the time of the binary summation can be completed in O(log N) time. Thus, step four uses O(log N) time. Steps five and six involve local operations only; and so their times are all O(1). In summary, the total time used for each iteration is O(log N).

Now, we calculate the average number of iter-

Y Pan /Future Generation Computer System II (1995) 321-327 325

ations for this algorithm. As we stated before, we assume the numbers in the set are unique without loss of generality. If they are not, we can replace xi by (xi, i> and define an order of the tuples by (xi, i) < (xi, j> if xi <xi or if xi =xj and i <j. In this way, all elements (tuples) are unique. Since the time complexity of the algorithm will remain the same after the replacement and the definition of the new order for these integer numbers, in the following discussion, we will assume that all numbers in the set are different from each other. Let R(N) be the expected number of iterations required by SELECT to select the k th smallest element in a set of N elements. Clearly, R(O) = R(1) = 1. In the best case, m is the kth smallest element and we do not need to continue calling SELECT recursively. On the other hand, in the worst case only one element is eliminated after each iteration. Hence, the total number of iterations in the worst case is N. In general, the selection problem of size N is reduced to a subproblem of size i, 0 I i <N, after each iteration. Since i is equally likely to take on any value between 0 and N - 1, we have the following relationship:

1 N-l Z?(N) I $ ,c R(i).

r=O (1)

We shall show that for N 2 2, R(N) I 4 log, N. For the basis N = 2, R(2) I iC:,,R(i) = 1 from (l), which is smaller than 4 log, 2. For the induc- tion step, write (1) as

R(O) + R(l) R(N)s N

+ i y$‘R(i) (2) l-2

Since log, i is concave upwards, it is easy to show that

N-l

C log, N I LN-‘log, x dx i=2

(3)

IN log, N-N-2log, 2+2.

Substituting (3) in (2) yields

R(N)r;+;(N log, N-N-2log,2+2)

(4)

<41og, N- (4N+8log, 2-81-2)

N

Since N r 2, it follows that 4N + 8 log, N - 10 2 0. Thus, R(N) I 4 log, N follows from (4); i.e. the total average number of iterations of the selection algorithm for a set of size N Is at most 4 log, N. The total expected time of the selection algorithm is the product of the averag number of iterations and the time spent in each e iteration. Therefore, we have the following theorem.

Theorem 1. The SELECT algorithm elects the kth smallest elements in a data set of J elements on a linear array with a reconfigurable thus of size N in 0(log2 N) expected time.

5. Oversized problems

We can generalize the above algorithm to solve an oversized selection problem on the ~ array. In the following, we assume that the size of the array is P and the number of integers in the set is N. If P is larger than N, the above algolithrn can be adapted easily. The only change is that we set IN = 0 initially for those PEs which do ‘ot have a data item in them. In this way, those P 4 s will not participate in the selection process even in the first iteration of the algorithm. Obviously, the time complexity of the algorithm remains the same.

If P is smaller than N, we assign q/P data items in the set to each processor of the array. Every processor now contains three vectors D, IN and B each of size N/P to store data and to indicate the status of its membership in the current set and in C,. The extended par llel

c! algo-

rithm differs from algorithm SELE only in that each processor performs the binary summation of its two local vectors IN and q serially first, then the addition of the sums ~obtained locally is carried out cross the whole mesh. Dur-

326 Y. Pan /Future Generation Computer Systems 11 (1995) 321-327

ing the removal of data items from further con- sideration, processor scans their local vectors serially. With these changes, steps one, three and six use O(N/P) time since each processor scans its local vectors of size N/P. The time used in steps two and five is still O(1). Step four uses O(N/P + log P) since the local binary additions take O(N/P) time and the addition across the array of size P requires log P time. Clearly, the total average number of iterations of the generalized selection algorithm for a set of size N remains O(log N). Using a similar analysis described above, we can obtain the following result.

Theorem 2. The generalized selection algorithm selects the kth smallest (or largest) element in a data set of N elements on a linear array with a reconfigurable bus of size P in O((N/P)log N + log P log N) expected time.

It is also very easy to adapt the above algorithm to find the set H of k smallest (or largest) elements in the list X. Clearly, we can sort the elements in X and then pick up the bottom (or top) k elements. Then, the total time is equal to that of sorting. Here, we present a different ap- proach. We first select the kth smallest (or largest) element using the above algorithm. Then we scan the elements of X and place an element of X in H if it is smaller (or larger) than the kth smallest (or largest) element. Since each processor contains N/P elements and each processor can scan its local elements independently, the whole scanning operation can be done in N/P time. Thus, we have the following theorem.

Theorem 3. Finding the set H of k smallest (or largest) elements in a data set of N elements on a linear array with a reconjigurable bus of size P can be done in O((N/P)log N + log P log N) expected time.

6. Conclusions

Selection can also be accomplished by sorting the whole set and then picking up the desired element (indirect selection). Efficient sorting al-

gorithms exit for shared-memory SIMD model. For example, sorting can be carried out in O((N/P)log N + log’ N) time on a CREW (Concurrent-Read, Exclusive-Write) shared- memory SIMD computers and in O((N/P>log N + log3 N) time on an EREW (Exclusive-Read, Exclusive-Write) shared-memory SIMD computers [4]. Sorting on a linear array without a bus system can be done in O((N log N)/P) + O(N) time [4]. It is not known whether we can sort N numbers on a linear array with a reconfigurable bus in less than OKN/P)log N + log P log N) expected time. Since sorting requires a large amount of data movement, it is unlikely that we can perform this. Thus, performing a selection directly seems to be more efficient for selection than using an indirect selection.

Many selection algorithms on array processors have been proposed and their corresponding worst-case time complexities have been obtained. However, in real world, the expected time complexity of an algorithm is a more important per- formance measure than the worst-case time complexity and is usually more difficult to derive. In this paper, we propose a new simple selection algorithm for a linear array with a reconfigurable bus. Instead of deriving its worst-time complexity, we concentrate on its expected time complexity. The result obtained in this paper is based on the assumption that every permutation of the set of numbers to be selected is equally likely to appear as an input. It remains open whether our algorithm achieves the theoretical lower bound.

Acknowledgements

We would like to thank two anonymous refer- ees for their valuable comments and suggestions.

References

[l] A. Aggarwal, A comparative study of X-tree, pyramid and related machines, Proc. 25th Annual IEEE Symp. on Foundations of Computer Science (Oct. 1984) 89-99.

[2] A. Aho, J. Hopcroft and J. Ullman, The Design and Analysis of Computer Algorithms (Addison-Wesley, 1974).

Y. Pan /Future Generation Computer Systems I1 (1995) 321-327 327

[31

[41

El

S.G. Akl, An optimal algorithm for parallel selection, Information Processing Letters 19 (1) (July 1984) 47-50. S.G. Akl, The Design and Analysis of Parallel Algorithms (Prentice-Hall, 1989). D. Bhagavathi, P.J. Looges, S. Olariu, J.L. Schwing and J. Zhang, A fast selection algorithm for meshes with multiple broadcasting, Int. Conf on Parallel Processing III (Aug. 17-21, 1992) 10-17.

[6] D. Bhagavathi, P.J. Looges, S. Olariu, J.L. Schwing and J. Zhang, Selection on rectangular meshes with multiple broadcasting, BIT 33 (1) (1993) 7-14.

[71

Bl

[91

[lOI

1111

[121

[131

[141

Y.C. Chen, W.T. Chen and G.H. Chen, Two-variable linear programming on mesh-connected computers with multiple broadcasting, Znt. Conf on Parallel Processing III (1990) 270-273. R. Cole and U. Vishkin, “Deterministic coin tossing and accelerating cascades: Micro and macro techniques for designing parallel algorithms, Proc. 18th Annual ACM Symp. on Theory of Computing (May 1986) 206-219. H. ElGindy and P. Wegrowicz, Selection on the reconfigurable mesh, Int. Conf on Parallel Processing III (Aug. 12-16, 1991) 26-33. H. Li and M. Maresca, Polymorphic-torus network, IEEE Trans. Comput. 38 (9) (1989) 1345-1351. H. Li and M. Maresca, Polymorphic-torus architecture for computer vision, IEEE Trans. Pattern Anal. Machine Intell. 11 (3) (Mar. 1989) 233-243. R. Miller, V.K. Prasanna-Kumar, D. Reisis and Q.F. Stout, Meshes with reconfigurable buses, MIT Conf on Advanced Research in VLSI (1988) 163-178. R. Miller, V.K. Prasanna-Kumar, D. Reisis and Q.F. Stout, Data movement operations and applications on reconfigurable VLSI arrays, Proc. Int. Conf Parallel Pro- cessing (Aug. 1988) 205-208. S. Olariu, J.L. Schwing, L. Wilson and J. Zhang, A simple selection algorithm for reconfigurable meshes, Fifth ISMM Znt. Conf on Parallel and Distributed Comput- ing and Systems, Pittsburgh, PA (Oct. l-3,1992) 257-261.

[15] J. Rothstein, Bus automata, brains, and mental models, IEEE Trans. Systems Man and Cybemet. 18, (4) (1988) 522-531.

[16] D.B. Shu and J.G. Nash, The gated interconnection network for dynamic programming, in Concurrent Com- putations, ed. S.K. Tewsburg et al. (Plenum, 1988) 645- 658.

[17] Q.F. Stout, Mesh connected computers with broadcasting, IEEE Trans. Comput. C-32 (9) (Sep. 1983) 826-830.

[18] S.L. Tanimoto, Sorting, histogramming, and other statis- tical operations on a pyramid machine, in @ltiresoZution Image Processing and Analysis, ed. A. Rosenfeld (Springer-Verlag, New York, 1984) 136-145.

ence department at the research interests inclu

Yi Pan was born in Jian $

su, China, on Mav 12. 1960. He ente ed Tsinzhua University in March 1978 with- the highest college entrance examination score of alI 1977 high ates in Jiangsu the B. Eng.-degree in neering from China. in 1982. and the in computer science fr

“p the Unker-

sity of Pittsburgh, USA, in 1991. Since 1991, he has been an assis-

tant professor in the computer sci- University of Dayton, 0

“, io, USA. His

Ide distributed comnutin . narallel algorithms and architectures, task scheduling, any ‘&age processing. He has published more than 28 papers in refereed international journals and conference proceedi gs related to his research. In 1990, he was awarded an An

t rew Mellon

Predoctoral Fellowship by the Mellon Foundatio . In 1994, in recognition of his success, the International Biog aphical Cen- ter of Cambridge, England, included his biogra k hical profile in the Sixteenth Edition of Men of Achievement.~

Dr. Pan is a member of the IEEE Computer Society and the Association for Computing Machinery.

Order statistics on a linear array with a reconfigurable bus

Documents

Transcript of Order statistics on a linear array with a reconfigurable bus