Scheduling Techniques of Processor Scheduling in Cellular …psrcentre.org/images/extraimages/93....

Abstract—Many problems in computer simulation of systems in

science and engineering present potential for parallel implementations through one of the three major paradigms of algorithmic parallelism, geometric parallelism and processor farming. Static process scheduling techniques have been used successfully to exploit geometric and algorithmic parallelism, while dynamic process scheduling is better suited to dealing with the independent processes inherent in the process farming paradigm. This paper considers the application of parallel or multi-computers to a class of problems exhibiting spatial data dependency characteristic of the geometric paradigm. However, by using processor farming paradigm in conjunction with geometric decomposition, a dynamic scheduling technique is developed to suit the MIMD structure of the multi-computers. The specific problem chosen for the investigation of scheduling techniques is the computer simulation of Cellular Automaton models.

Keywords—Cellular Automaton, multi-computers, parallel paradigms, scheduling.

I. INTRODUCTION TATIC and dynamic scheduling of processes are techniques that can be used to optimize performance in parallel

computing systems. When dealing with such systems an acceptable balance between communication and computation times is required to ensure efficient use of processing resources. When the time to perform the compute on a sub-problem is less than the time taken to receive the data or transmit the results, then the communication bandwidth becomes a limit to performance. With dynamic scheduling, an appropriate program can redirect the flow of data at run time to keep the processors as busy as possible and help achieve optimum performance [1].

The problem chosen here for the investigation of scheduling techniques is the cellular automaton (C.A.). The C.A. approach has been used in many applications, such as image processing, self learning machines, fluid dynamics and modeling parallel computers. Because of their small compute requirements, many C.A. algorithms implemented on a network of processors, exhibit the above discussed imbalance.

Mohammad S. Laghari is with the Electrical Engineering Department, Faculty of Engineering, United Arab Emirates University, P.O. Box: 17555, Al Ain, U.A.E. (phone: 00971-50-6625492; fax: 00971-3-7623156; e-mail: [email protected]).

Gulzar A. Khuwaja is with the Department of Computer Engineering, College of Computer Sciences & Information Technology, King Faisal University, Al Ahsa 31982, Kingdom of Saudi Arabia (e-mail: [email protected]).

A cellular automaton simulation, with artificially increased compute load per cell (in the form of number of simulated multiplies) is considered for parallelization. Such a simulation is representative of a class of recursive algorithms with local spatial dependency and fine granularity that may be encountered in biological applications, finite elements, certain problems in image analysis and computational geometry [2]-[5]. These types of applications exhibit geometric parallelism and may be considered best suited to static scheduling. However, using dynamic scheduling, the MIMD structure of multicomputer networks is exploited, and comparison of both the schemes is given in the form of total timings and speedup.

II. THE C.A. MODEL Cellular automata were introduced in the late forties by

John von Neumann, following a suggestion of Stan Ulam, to provide a more realistic model for the behavior of complex, extended systems [6].

In its simplest form, a cellular automaton consists of a lattice or line of sites known as cells, each with value 0 or 1. These values are updated in a sequence of discrete time steps according to a definite, fixed, rule. The overall properties of a cellular automaton are usually not readily evident from its basic rule. But given these rules, its behavior can always be determined by explicit simulation on a digital computer.

Cellular automata are mathematical idealizations of physical systems in which space and time are discrete, and physical quantities take on finite set of discrete values. The C.A. model used in this investigation is a 1-dimensional cellular automaton where processing takes place by a near homogeneous system having a fine level of granularity. It is conceptually simple and has a high degree of parallelism. It consists of a line of cells or sites xi, (where i = 1, ... , n) with periodic boundary conditions xn+1 = x1 which means that last cell in the line of site is connected to the first cell. Each cell can store a single value or variable known as its state. At regular intervals in time the value of cells are simultaneously (synchronously) updated according to a local transition rule whose result depends on the previous state of the cell and those of its neighbors. The neighborhood of a given site is simply the site itself and the sites immediately adjacent to it on the left and right. Each cell may exist in one of two states x i = 0 or 1.

The local rules of C.A. can be described by an eight-digit binary number as shown in the following example. Fig. 1 specifies one particular set of rules for an elementary C.A.

Scheduling Techniques of Processor Scheduling in Cellular Automaton Mohammad S. Laghari and Gulzar A. Khuwaja

S

International Conference on Intelligent Computational Systems (ICICS'2012) Jan. 7-8, 2012 Dubai

96

0000

1001

1010

0011

1100

0101

0110

1111

Fig. 1 The 8 possible states of 3 adjacent sites

The top row gives all the 23 = 8 possible values of the three

sites in the neighborhood, and below each one is given the values achieved by the middle site on the next time step according to a particular local rule. As any eight-digit binary number specifies a cellular automaton, therefore there are 28 = 256 possible distinct C.A. rules in one dimension with a 3 site neighborhood. The rule in the lower line of the Fig. 1 is rule number 150 (10010110) which have been used for the implementation of C.A. algorithms in this paper.

The rules may be considered as a Boolean function of the sites within the neighborhood. Let xi(t) be the value of site i at time step t. For the above example, the value of a particular site is simply the sum modulo two of the values of its own and its two neighboring sites on the previous time step. The Boolean equivalent of this rule is given by:

2)))()()(()1( 11 REMtxtxtxtx iiii +− ++=+

where REM is the remainder function.

This can be written in the form of:

)()()1( 11 txxtxtx iiii +− ⊕⊕=+ or schematically

+−+ ⊕⊕= xxxx

where, ⊕ denotes addition modulo two or exclusive disjunction,

+x denotes value of a particular site for the next time step and +− xxx ,, denotes values of its own and its neighboring sites

on the previous time step, respectively. The following shows how the above equations relate to rule

number 150 of C.A. Suppose

CxBxAx ≡≡≡ +− ,, then using Boolean laws the schematic equation becomes:

CBACBACBACBA

CBABA

CBA

........

)..(

+++

⊕+

⊕⊕

Putting this equation in the truth Table I shows the output

giving rule number 150 in the binary form when read from the most significant bit.

150241612801101001 1248163264128 =+++=

Fig. 2 shows the evolution of a particular state of the C.A.

through two time steps in the above example.

TABLE I FINDING RULE NUMBER IN BINARY FORM

A B C output 0 0 0 0 0 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 0 0 1 1 1 1

Fig. 2 Evolution of 1-D C.A. through two time steps

Fig. 3 shows evolution of 1-dimensional elementary cellular

automaton according to the above described rule, starting from a state containing a single site with value 1. Sites 1 and 0 are represented with ‘*’s and ‘ ‘s, respectively. The configuration of the cellular automaton at successive time steps is shown on successive lines. The time evolution is shown for at most 20 time steps or up to the point where system is detected to cycle.

Fig. 3 Evolution of C.A. into a configuration up to 20 time steps

III. PARALLEL PARADIGMS In order to efficiently utilize the computational potential of

a large number of processors in a parallel processing environment, it is necessary to identify the important parallel features of the application. There are several simple paradigms for exploiting parallelism in scientific and engineering applications, but the most commonly occurring types fall into three classes. These three paradigms are described in more detail in [7], [8].


97

A. Algorithmic Parallelism Is present where the algorithm can be broken down into a

pipeline of processors. In this decomposition the data flows through the processing elements.

B. Geometric Parallelism Is present where the problem can be broken down into a

number similar processes in such a way as to preserve processor data locality and each processor operate on different subset of the total data to be processed.

C. Processor Farm Is present where each processor is executing the same

program with different initial data in isolation from all the other processors in the farm [9], [10].

IV. ALGORITHMS In order to meet the high speed and performance, a scalable

and reconfigurable multi-computer system (NPLA) is used. This networked multi-computer system is a bit similar to the NePA system used to implement Network-on-Chip [11].

The system used is a linear array of processors. It includes RISC processors and memory blocks. Each processor in the array has a compactOR, internal instruction memory, internal data memory, data control unit, and registers. One of the processors is used as a master or main processor and the remaining as slaves. The system has a network interface with the main processor having four and others equipped with two port routers. Routers can transfer both control as well as application data among processors. The two scheduling algorithms are described:

A. Static Algorithm In this implementation of cellular automaton, the problem is

statically implemented by using array processing. The algorithm is properly decomposed by using geometrical parallelism. Ideally, the master processor should distribute fixed number of cells uniformly across the ring of slave processors. At the start of an individual iteration, each cell process broadcasts the current state of the cell to its neighbors in parallel with inputting the states of its neighbors from the neighboring cell processes. After this exchange of data, the cell update its new state using the rule described earlier. Instead of individual cell processes in each slave which communicated with the neighboring cells after every update, the master processor distributes fixed size array segments of cells (for a total length of a maximum768 cells) uniformly across the worker array, with each processor being responsible for the defined spatial area.

Each iteration starts with slave processors first exchange boundary information with the neighboring processors in such a way that end elements of each array segment carry information of the end elements of the neighboring segments.

After this exchange, the array segment updates the results with the help of the neighboring elements for all the elements in parallel by using the cellular automaton rule described earlier. The updated results are assigned in another array.

Results of all iterations are communicated back to the master processor.

Simulation tests are carried out for 20 iterations or time steps using from 1 to 7 slave processors, supplied with fixed size array segments for the total array length of 768 cells. Artificially increased compute loads in the form of multiplies per cell (in steps of 20 multiplies) are introduced. Five loads of 20, 40, 60, 80 and 100 multiplies, respectively are used, which reside in the worker process of each slave. Table II shows the total timings in seconds for a normal and a range of artificially increased compute loads.

TABLE II

TOTAL TIMING IN SECONDS FOR 20 ITERATIONS IN STATIC SCHEME Result without the additional compute load shows no

improvement in performance when the algorithm is implemented on multiple processors. The communications take more time than the computation in each slave. Results with the compute load of 20 additional multiply show that there is a reasonable improvement in timings. The comparison shows that with the increase in the compute load, the overall performance of the algorithm and the utilization of the processors proportionally improve.

B. Dynamic Algorithm In the previous implementation the allocation of processes

to processors is defined at compile time. It is possible to have the program perform the process allocation as it runs. In this implementation of cellular automaton, distribution of processing loads is performed dynamically. The topology used is the same as in the previous examples, which is a master processor and up to 7 slaves, now operating as a farm of processors with the code replicated on each of them.

In this algorithm, the master processor distributes work packets to the farm of slave processors. This processor is also responsible for geometrical decomposition and the tracking of the work packets through the iteration sequence. It consists of two main processes of send and receive, which execute in parallel and share two large arrays of data send and data receive. At the start of the first iteration, the send process farms out fixed size data packets from the send array (which contains the line of site to be computed) to the slave processors. Each data packet includes; an array segment of cells, address of the segment location in the send array, and information about the end elements of the neighboring segments.


98

The slave processors operate two main processes both running in parallel. One is a worker process where actual computation takes place and is run in low priority with the other which is a work_packet_schedular as shown in Fig. 4.

Fig. 4 Work packet schedular on slave processors

The work_packet_schedular on each slave consists of: • a schedular process which inputs data packets from the

master and schedules tasks through buffers either to the worker process or to the next processor in the chain of slaves on the first come first served basis. The buffers operate as request buffers which is as soon as the buffers have served their tasks, more work is requested from the scheduler process. If request for work from the worker process and next processor arrive at the same time then priority is given to the worker process.

• a data_passer process which inputs resultant data through buffers both from the worker process or previous processor on the first come first served basis and forwards it to the next processor leading towards the master processor.

In order to keep the slave processors busy, the task schedular buffers an extra item of work so that when the worker process completes the computation for an array segment it can start on its next at once rather than having to wait for the master processor to send the next item of work.

The worker process inputs the array segment together with the information of the end bits of neighboring segments and the address bits. Then, updates the segment according to the C.A. rule described earlier, stores the result in another array, adds address bits and communicates it to the data_passer process.

The processed array segments together with the address bits are received by the other main process of receive in the master processor and are placed in the data receive array at the appropriate positions. This completes the first iteration.

For subsequent iterations, array segments can only be sent for processing if adjoining neighbors are present; this is because of the end element information of the neighboring segments. Therefore, as soon as the master processor receives 3 contiguous segments in the data receive array, it copies the middle segment to the data send array. When 3 contiguous segments are copied to the data send array, then the middle segment from this array is sent to the slaves for further processing.

Experiments are performed on the dynamically allocated scheme by varying the network sizes, the computational loads, and the size of the work packets in order to obtain optimum performance parameters. Timings from 1 to 7 slave processors are obtained for 20 iterations. Experiments are performed with varying packet sizes of 12, 24, and 48 cells for the total array length of 768 cells. Additional compute loads in the form of 20, 40, 60, 80 and 100 multiplies, are used.

Table III shows computation timings in seconds for the array lengths of 24 for the dynamic scheme. The results of dynamic allocation show reasonable improvements in timings for the three packet sizes; the exception being the compute load of 20 multiplies which shows small improvements in performance for smaller networks.

TABLE III

TIMING FOR 20 ITERATIONS IN DYNAMIC SCHEME FOR 24 CELLS The speedup for the packet size of 24 cells show very good

results for all the additional compute loads except for case of 20 multiply as shown in Fig. 5. A near linear speedup is shown when four slave processors are used. For the load of 60 multiplies, speedup of 5.76 is achieved when all the slaves are used. The results for the three segment sizes of 12, 24 and 48 cells are compared with artificially increased compute loads in terms of speedup. For comparison, compute loads of 20 and 100 multiplies are chosen.

Fig. 6 shows speedup, for the case of 20 multiplies. Array size of 12 cells shows no improvements in the result. The reason being that for the case of 12 cells, the master processor distributes 64 array segments for each line of site of 768 cells. Therefore, the master communicates a total of 1280 array segments to do 20 iterations. With the compute load of 20 multiplies for each cell, the system does not balance the computation and communication loads. The results prove that


99

the system is taking much more time to communicate data packets of this size to and from the slave processors and thus show poor performance. Increasing the size of the data packets for the additional load of 20 multiplies has a small effect on the performance. The array size of 48 cells shows slight improvements for up to 3 slave processors.

Fig. 5 Speedup for 24 cells in dynamic scheme

Fig. 6 Comparison of speedup results for the load of 20 multiplies

Fig. 7 shows the speedup, for the case of 100 multiplies.

Excellent results are obtained for all the array segments, when from 1 to 4 slave processors are used. Again, the array size of 24 cells gives the best performance results for using all the available slave processors. Therefore, when comparing the results for all the additional compute loads, array segment of size 24 with the compute load of 100 multiplies gives the best performance parameters in the dynamic scheduling scheme.

Fig. 7 Comparison of speedup for the load of 100 multiplies

Fig. 8 shows the timing comparison for two schemes for

seven processors. Except for 20 compute load, the dynamic scheme performs better for all other loads.

Fig. 8 Comparison of timings between the two schemes

V. CONCLUSION In this paper we have considered a modified C.A. model

with artificially increased load. The recursive structure and spatial data dependency of this algorithm is representative of an important class of algorithms in science and engineering. The paper investigates the performance of scheduling techniques for the implementation of this type of algorithm on multicomputer networks. Experiments performed on implementation of above techniques suggest that over certain ranges of compute load, dynamic scheduling can outperform its rival in terms of speedup.

REFERENCES [1] T. L. Casavant and J. G. Kuhl, “A Taxonomy of Scheduling in General-

Purpose Distributed Computing Systems,” IEEE Trans. on Software Engineering, vol. 14, no. 2, Feb. 1988.

[2] M. V. Avolio, A. Errara, V. Lupiano, P. Mazzanti, and S. D. Gregorio, “Development and Calibration of a Preliminary Cellular Automata Model for Snow Avalanches,” in Proc. 9th Int. Conf. on Cellular Automata for Research and Industry, Ascoli Piceno, Italy, 2010, pp. 83–94.

[3] D. Cacciagrano, F. Corradini, and E. Merelli, “Bone Remodelling: A Complex Automata-Based Model Running in BIO SHAPE,” in Proc. 9th Int. Conf. on Cellular Automata for Research and Industry, Ascoli Piceno, Italy, 2010, pp. 116–127.

[4] M. Ghaemi, O. Naderi, and Z. Zabihinpour, “A Novel Method for Simulating Cancer Growth,” in Proc. 9th Int. Conf. on Cellular Automata for Research and Industry, Ascoli Piceno, Italy, 2010, pp. 142-148.

[5] Y. Zhao, S. A. Billing, and A. F. Routh, "Identification of Excitable Media Using Cellular Automata Models,” Int. J. of Bifurcation and Chaos, vol. 17, pp. 153-168, 2007.

[6] A. IIanchinski, Cellular Automata – A Discrete Universe. Singapore: World Scientific Publishing, 2001.

[7] D. J. Pritchard, “Transputer Applications on Supernode,” in Proc. Int. Conf. on Application of Transputers, Liverpool, U.K., Aug. 1989.

[8] M. S. Laghari and F. Deravi, “Scheduling Techniques for the Parallel Implementation of the Hough Transform,” in Proc. Engineering System Design and Analysis, Istanbul, Turkey, 1992, pp. 285-290.

[9] A. S. Wagner, H. V. Sreekantaswamy, and S. T. Chanson, “Performance Models for the Processor Farm Paradigm,” IEEE Trans. on Parallel and Distributed Systems, vol. 8, no. 5, pp. 475-489, May 1997.

[10] A. Walsch, “Architecture and Prototype of a Real-Time Processor Farm Running at 1 MHz,” Ph.D. Thesis, University of Mannheim, Mannheim, Germany 2002.

[11] Y. S. Yang, J. H. Bahn, S. E. Lee, and N. Bagherzadeh, “Parallel and Pipeline Processing for Block Cipher Algorithms on a Network-on-Chip,” in proc. 6th Int. Conf. on Information Technology: New Generations, Las Vegas, Nevada, Apr. 2009, pp. 849-854.


100

Scheduling Techniques of Processor Scheduling in Cellular …psrcentre.org/images/extraimages/93....

Documents

Transcript of Scheduling Techniques of Processor Scheduling in Cellular …psrcentre.org/images/extraimages/93....