Download - SLAC-161 UC -32 COMPUTING WITH MULTLPLE …€¦ · ABSTRACT Computer systems with multiple processors are becoming more common for reasons of reliability and modularity. The use

SLAC-161 UC -32

COMPUTING WITH MULTLPLE MICROPROCESSORS*

JOHN V. LEVY

STANFORDLINEARACCELERATORCENTER

STANFORD UNIVERSITY

Stanford, California 94305

PREPARED FOR THE U. S. ATOMIC ENERGY

COMMISSION UNDER CONTRACT NO. AT(04-3)-515

April 1973

Printed in the United States of America. Available from National Technical Information Service, U. S. Department of Commerce, 5285 Port Royal Road, Springfield, Virginia 22 15 1. Price: $3.00; Microfiche $0.95.

*Ph. D. Dissertation.

KEY WORDS AND PHRASES

asynchronous parallel processor

computer architecture

computer design

emulation

FORK- JOIN

instruction overlap

instruction set

machine language I

microprogramming

multiple processor

multiprocessing

parallel processor

sorting

Computing Reviews categories: 4.32, 6.20, 4.21, 4.13

ABSTRACT

Computer systems with multiple processors are becoming more common

for reasons of reliability and modularity. The use of asynchronous processors,

however, leads to problems of complexity of control and of programming. This

work investigates the application of multiple asynchronous processors to the

computing task at the lowest level - that of interpreting single machine-

language instructions.

A particular computer configuration with 15 identical processors has been

constructed using an interpretive simulator. The processors are of relatively

low computing capacity. A common data bus connects the processors with each

other and with the main memory. A restriction on the logical connections

between processors allows each one to communicate with no more than two

others, in a chain-like arrangement.

Three examples - two sort instructions and a matrix multiply - were

coded for this machine and run using the simulator. By varying the bus cycle

time, it was concluded that adequate support of up to 15 processors can be

provided by a common bus with cycle time equal to the processor cycle time.

The amount of parallelism achieved was significant but showed dependence

on hardware parameters and on the algorithm implementations. Direct simu-

lation of the computer, with an execution trace of the running system, has

yielded some glimpses of how restriction of bus capacity can cause deteriora-

tion of the program execution efficiency and amount of parallelism.

A simple economic model of a multiple processor system is developed and

applied to the three examples. The result shows that the minimum cost per

throughput occurs with four, eleven, and fifteen processors, respectively, for

the three examples when the cost of a processor is one-tenth of the system cost.

. . . - 111 -

ACKNOWLEDGMENTS

I acknowledge with gratitude the persistent guidance of William F. Miller,

Vice-President and Provost of Stanford University, who has also taught me

something about university administration.

The research for this dissertation has been supported over the years by

the able assitance of Harriet Canfield, Kathleen Maddern, Linda Lorenzetti,

Olga Stanners, and Carla West. Mrs. Canfield was particularly helpful at the

crucial time.

Discussions and arguments over many years with my coneague Victor

Lesser and more recently with Maurice Schlumberger have been instructive.

I appreciate the efforts of Professors Edward J. McCluskey, Jr., Edward S.

Davidson, andForest Baskett in reading and commenting on this thesis and

fondly acknowledge the memory of George E. Forsythe, who should have been

among them and us all.

Finally, credit is due the untiring devotion of WYLBUH and the IBM

System/ 360 Model 91 at the Stanford Linear Accelerator Center, whose support

never slackened.

- iv -

TABLEOFCONTENTS

Page

CHAPTER 1

Introduction and Annotated Bibliographg ............ 1

Introduction ...................... 1

Programming Context. .................. 3

The Multiple-Processor Approach. ............ 8

Literature Survey and Annotated Bibliography. ....... 12

CHAPTER 2

Computer Hardware Description. ............... 34

The Multiple Processor Model .............. 34

Description of the Z-machine ............... 39

The Processor ................... 40

The Microprogram Store ............... 46

The Main Store ................... 46

The Bus and Allocator ................ 51

The Allocator .................... 51

CHAPTER 3

Description of Example Programs .............. 54

Introduction ...................... 54

Expected Run Time Variation .............. 55

Processor Allocation and Deallocation .......... 63

Example 1: Sifting Sort. ................ 65

Example 2: Shell Sort. ................. 67

Example 3: Matrix Multiply ............... 71

-V-

Page

CHAPTER 4

Measurements and Results .................. 76

Measurements ..................... 76

Results. ........................ 80

1. Bus and Memory Utilization ........... 80

2. Bus Capacity .................. 81

3. Main Memory Speed ............... 87

4. Amount of Parallelism .............. 88

5. Run Time versus Number of Processors ...... 96

6. Cost-Performance Analysis. ........... 96

CHAPTER 5

Unexpected Results and Possible Extensions .......... 106

Results from Execution Trace .............. 106

Processor-to-bus Buffering ............... 109

Program “overhead”. .................. 112

Data Bus Design .................... 113

Hardware Design Alternatives .............. 114

Software Alternatives and Extensions ........... 114

APPENDIX A - The CISCO Simulation System ........... 117

APPENDIX B - Detailed Flowcharts, Data Values, and

Assembler Listings ............... 118

APPENDIX C - List of Z-machine Operation Codes ........ 156

APPENDIX D - Result Data ................... 159

-vi-

LIST OF ILLUSTRATIONS Page

l-l

l-2

2-l

2-2

2-3

3-la

3-lb

3-2

3-3

3-4

3-5

3-6

3-7

4-l

4-2

4-3

4-4

4-5

4-6

4-7

4-8

Four levels of computer operation ............ 4

Programmer’s View of processor connections ...... 10

Z-machine configuration ................ 40

Z-machine bus bit assignments ............. 48

Z-machine bus transactions ............... 43

Example 1 : Variation of problem cha.racteristics with

number of processors. ........... 56

Example 1: Variation of problem characteristics with

number of data elements .......... 57

Summary of problem characteristics ........... 62

Shell sort - simplified flowchart ............. 68

Shell sort - implementation strategy, N=8. ........ 70

Matrix multiply - simplified flowchart .......... 72

Stack contents for matrix multiply, N=2 ......... 74

Register contents for matrix multiply, N=2 ........ 75

Sample summary sheet ................. 77

Average utilization of bus and main memory ....... 80

Example 1: sifting sort - execution efficiency ...... 82

Example 2: Shell sort - execution efficiency ....... 83

Example 3: matrix multiply - execution efficiency .... 83

Example 1: sifting sort - bus speed dependence. ..... 84

Example 3: matrix multiply - bus and memory

speed dependence . . . . . , . . . . . . . . 85

Example 2: Shell sort - bus and memory speed

dependence . . . . . . . . . . . . . . . . 86 - vii -

4-9

4-10

4-11

4-12

4-13

4-14

4-15

4-16

4-17

4-18

4-19

4-20

4-21

4-22

4-23

5-l

5-2

5-3

5-4

5-5

B-l

Typical parallelism and execution time ..........

Example 1 - parallelism with fast bus and memory ....

Example 1 - parallelism with slow bus and memory ....

Example 2 - parallelism with fast bus and memory ....


Example 3 - parallelism with medium bus and

fast memory ...............


Example 1 - run time curves ..............



Constant cost curves ..................

Example 1 - cost curves ................



Number of processors for minimum cost/throughput ...

Sample of execution trace from Example 1 ........



Comparison of runs with and without bus

buffer (Example 3) .................

Microprogram overhead for multiple-processor

coordination ....................

Z-machine definitions for assembler. ..........

Page

88

90

91

92

93

94

95

97

98

99

102

103

104

105

101

107

108

110

111

112

118

. . . - Vlll -

Page

B-2 Example 1 - assembler listing ............. 120



B-5 Example 3 - matrix data values ............ 146

B-6 Example 1 - detailed flowchart. ............ 147

B-7 Example 2 - detailed flowchart ............. 150

B-8 Example 3 - detailed flowchart ............. 154

- ix -

“One of my primary objects is to form the tools so the tools themselves shall fashion the work, ‘I

Eli Whitney

CHAPTER 1

INTRODUCTION AND ANNOTATED BIBLIOGRAPHY

Introduction

The development of computers has been dominated since its inception by

the concept of a processor as a powerful calculating engine surrounded by

supporting facilities - random-access memory, input and output transmission

media, and peripheral devices. Improvements in circuit speeds and arithmetic

capabilities have merely emphasized the ability of the computer designer to

create far more computational capacity in a central processor than can be

dispensed in any useful way. The latest hardware developments have con-

centrated on using processor-like resources to augment memory, input-output,

and peripheral facilities. Software development has brought about multipro-

gramming - computing alternately on several available tasks by dividing the

memory and input-output resources among them - which has helped to reclaim

a portion of the expensive but unused central processor capacity.

The thesis of this dissertation is that processing (computation) capacity

should be divided and dispensed on demand by using multiple asynchronous

processors of low capacity. We should no longer be intimidated by apparent

problems of control of large numbers of processors. Justification of this

position will be by construction of a plausible multiple-processor computer

system and by demonstrating the running of several example algorithms on it.

-l-

The construction has been made in a simulation context, and by varying

some of the design parameters, the limitations imposed by the organization

of this particular computer system have been measured.

In most computing systems built until recently, the central processor

has been of relatively high capacity. Therefore, multiple processor config-

urations have been considered reasonable only in “large systems17 - ones

where the computational burden is expected to be high. In order to extend the

validity of the multiple-processor concept to systems of smaller capacity, a

small discrete unit has been chosen. The processor used in the studies here

has the approximate computing power of a typical (under $5,000 today)

“mini-computer”. The choice of such a small processor makes the designer

immediately face the following trade-off: as the unit size of a resource gets

smaller, the significance of the overhead of allocation usually becomes greater

because more units must be allocated for a typical task. The system shown

here has a low time-overhead of processor allocation. In order to guarantee

that this is achievable, certain compromises have been made in the generality

of inter-processor communication.

What kind of parallelism is possible with an asynchronous, multi-

instruction-stream organization? The example algorithms shown here are all

intended to be typical of routines which implement a single (higher-level)

operation. In this sense, the algorithms are emulation routines, micropro-

gram implementations of higher-level language constructs. Whether the

reader chooses to call these constructs llmachine-language” operations will

depend on his viewpoint. In any case, the routines are Ynterpretivetf in the

sense that each one performs a single’ conceptual operation. The fact that

parallelism is demonstrated here only within the interpretive routine (i. e., at

-2-

the “micro” level) is not intended to rule out the possibility of further paral-

lelism at higher levels, by use of the more conventional techniques of co-

routines, multiprocessing, and multiprogramming. However, the oper-

ations implemented by these example routines are to be regarded functionally

as primitive operations in which the parallelism is out of the control of the

problem-programmer (the user of the operations). This idea will be foreign

to the software-oriented reader who has not concerned himself with the paral-

lelism and control problems in, say, the arithmetic unit of a processor which

is executing a “multiply” instruction. The processors in this simulated system

are intended to be a resource at the command of the microprogrammer, who

is implementing “machine-language” instructions by writing emulation routines

for them.

The example operations demonstrated here - two sorting methods and a

matrix multiply - are on the other hand rather lfhigh-level’l operations

compared to typical machine language instructions. Although the author is a

strong advocate of more complex l’machine-language” instructions, these

examples are not necessarily the top candidates for new general-purpose

machine language instructions. The hardware-design oriented reader is

encouraged to accept them as vehicles for demonstrating the modes of multiple

processor interaction possible in this system. The three main examples are

also intended to stress, respectively, the bounds on inter-processor communi-

cation, memory access, and computation resources.

Programming Context

Figure l-l, shows four of the levels of operation in a computer system

when a job is to be run. A system analyst or user describes the task to be

performed with a flowchart, shown at level 1. A programmer transforms the

-3-

P

::: ::: LH 14,“O..TAELEIlB~ ‘H ~4,YO..TbBLE1(71 BC 10,c.L. 1s

SORT I/ EN” ;

ENDi /* J-LOOP I, EN”; ,* I-LOOP r,

ENOi ,* SORT, *,

23 END

PROBLEM DESCRIPTION

FLOW CHART

LEVEL 1

HIGH-LEVEL LANGUAGE

PL/I

LEVEL 2

a.1 e,e 14,VO..TAELE1,81 L4,TEMP

LH 8.J \ AR BIB LH I4,TEHP STH ,4r”C..TbBLEL,R,

MACHINE LANGUAGE

IBM 360 ASSEMBLER

LEVEL 3

MICRO- CODE

360 MODEL 40

FIG. l-l. Four levels of computer operation.

flowchart into a set of procedures written in a high-level language. The

sorting portion of this example is shown coded in PL/l. The high-level

language is compiled into machine-language and executed. Shown in the figure,

at level 3, is a portion of the IBM System/360 machine code generated by

compiling the PL/l program shown at level 2. In a processor implemented

using microprograms, some or all of the machine-language instructions in-

voke a sequence of micro steps. At level 4 in the figure is shown a small

portion of a microinstruction sequenca invoked in an IBM microprocessor, in

carrying out a machine-language instruction (Husson [ 521, p. 290).

We pause to observe that “machine language” refers to a machine which

often exists only on paper and in the mind of the machine-language programmer.

We will use the word l’architecture” to refer tc the structure of this conceptual

machine, as in “IBM System/360 architecturel’. The term ?irtual machine”

is also used to refer to such a conceptual machine, but we avoid using this

term in the machine-language context because it is also used for other concepts.

The underlying microprogrammed hardware which implements a given

machine language may have widely varying structure. We will refer to this

structure as machine ~lorganization”. The activity of executing a sequence of

microinstructions which implement a machine-language instruction will be

called “emulation”. Emulation is a particular case of interpretive execution.

The software support necessary tc carry out the activity of Fig. l-l is

not shown. A compiler is necessary to transform the program at level 2’to

level 3. A loader and a linkage editor are necessary to put together the por-

tions of a program before execution. A sizable software environment is pre-

sumed when the program is running, to handle error conditions and file access.

-5-

At higher levels, executive control of job processing, and file and hardware

maintenance routines, are presumed to exist.

In the work described here, a specific multiple-processor computer

organization is proposed, and used to implement three different example

instructions which currently would be regarded as being at level 1 or level 2.

In particular, the SORT operation portrayed in Fig. l-l& Example 1, a

sifting or “bubble” sort algorithm. In this case, we have jumped over two of

the levels and have implemented directly in microcode an operation which has

a relatively large amount of processing to do and yet is a single conceptual

unit in the original problem statement. This type of level-jumping has been

discussed in previous work in Wirect execution” of high-level languages, in

which the compiling step of the transformation from level 2 to level 3 is

removed; and in the extension to direct microcode implementations of high-

level constructs.

The example operations shown in this work should be considered as part

of a large set of high-level primitive operations which would make up a

“machine language”. The algorithmic problem description, possibly at level 1,

would still require compiling into sequences of these primitive operations.

However, the more closely the primitive operations resemble the high-level

conceptual units, the easier it will be to generate compact representations of

these units in low-level code.

A stronger argument for implementing high-level constructs in microcode

is that execution time can be saved. This time saving is achieved at the

expense of some flexibility available with simpler processing primitives.

Another saving, however, is the removal or reduction in complexity of the

programming task at the next level up. The human activity of programming

-6-

is one of the largest components of cost in performing a computer-based task.

Any savings in the often-repeated job of high-level language coding can be

applied against the relatively infrequent, but now more sophisticated task of

microcode generation.

The increase in the difficulty of the microprogrammer’s task, a con-

sequence of t’elevatingn the machine language, is not to be skipped over lightly.

The microprogrammer has responsibility for properly controlling independent

functional units of the hardware, as well as producing an algorithm which is

guaranteed tc execute correctly for all possible timing and data conditions (or

at least: to execute correctly or detect errors in an orderly way for all

possible conditions). If it becomes impossible to find people who can write

such microcode for the high-level constructs chosen, the viability of the

machine organization proposed here - applied to such high-level operations -

is doubtful. However, we contend that a small number of microprogrammers,

well versed in both hardware control and algorithm design, can develop all the

primitive operations which may be required for special or general-purpose

computation. They will only require careful specification of the set of primitive

operations, and a proper set of criteria for success or cost measures.

The great inhibition to development of higher-level primitive operations,

other than the inertia of tradition, has been our inability to answer the question:

what are the primitive operations appropriate to the problem? Even when the

answer to this question is known, the answer is highly specific to the problem.

Therefore, a set of low-level operations with which one can “do anything” has

been the solution to both general-purpose and special-purpose computing.

Now that read-write microcode storage allows us to consider modifying the

machine language of a fixed hardware organization, and the market for

-7-

single-application (special-purpose) computers is large enough, we are ready

to consider designing specialized sets of instructions tailored to the application.

The instructions which are the examples in this work - two sort instruc-

tions and a matrix multiply - were chosen not so much for being prototypes

of an inevitable future instruction repertoire as for demonstrating the level of

complexity achievable in reasonable lengths of microcode, using plausible

underlying hardware. The choice of algorithms for implementing them was,

of course, influenced by the desire to elucidate the strengths and limitations

of this hardware organization.

The Multiple-Processor Approach

In large computing systems, the demand for reliability and for high peak

computing capacity has led to implementation of a number of multiple-processor

configurations. The cost of the relatively large processors in these systems

is typically justified by either the even larger investment in peripheral devices

or by a very high premium on peak capacity. The economics of computer

electronics today is now making it possible to consider implementing relatively

small-capacity computer systems with multiple-processors. In particular,

if the unit of “processor resource” is implementable on a few integrated-

circuit chips, the cost per processor (if they are identical) may become a very

small percentage of the total (hardware) system cost. (This economic consid-

eration is explored further in Chapter 4.) Because of this, computer manu-

facturers and others are beginning to propose systems composed of multiple

“mini-computers” (Dorff [ 441, Mutzeneek 14’71, Stern [ 711). The research

described here is intended to shed light on the questions which arise when a

-8-

computer system contains a number of identical processors, each with rela-

tively small computing capacity. Some of these questions are:

1. How can the computing power of the processors be combined to

speed up the execution of a given task?

2. By what mechanism are the processors to be allocated to a task,

and among competing tasks ?

3. How are the processors to communicate with each other?

4. To what extent is the multiplicity of processors to be reflected

in the programming language seen by the “user”?

In this study, we have answered %ot at all” to this last question, choosing

to isolate the parallelism within one programming level. Thus, a sort oper-

ation which may invoke more than one processor is seen by the user as a

primitive operation and a functional description of this operation would not

have to refer to its multiple-processor implementation method. The novelty

of the programming presented here is that the execution does not depend on

the availability of any specific number of processors.

In order to simplify the programming, a restriction on the communications

between processors has been imposed; this restriction also simplifies the

allocation of processors. Figure l-2 shows the functional connections which

may be made by an executing processor. Not shown are the allocator mecha-

nism, which creates the connection to a “slave” processor, or the micropro-

gram store, which provides an instruction stream to each executing processor.

A complete description of the Z-machine (as this model is called) is given in

Chapter 2.

By restricting the interconnections between processors to this linear

Tmary tree”, as it might be called, the programming of a task is constrained

-9-

t + PM

A MAIN

MEMORY v

4 + P

A

“MASTER”

“SLAVE”

2106AZZ

FIG. l-2. Programmer’s view of processor connections.

-IO-

in several ways. First, the symmetry of the chain of processors creates an

incentive to write re-entrant code, causing each processor to behave like its

neighbor; this also, incidentally, makes debugging easier. Second, the

instruction set of each processor need contain only three different input source

and output destination references (four, if the allocator is included). This

allows relatively simple testing and state-keeping during execution; it also

simplifies the hardware design. Finally, a processor may execute a detach

instruction to remove itself from the chain it is in. In this way, completely

independent processing is possible. This mechanism was not however used in

any of the three examples here.

The central thrust of this work is to provide a demonstration of one

answer to question 1: how to combine the computing power of many small

processors. By restricting the hardware and software contexts to a particular

configuration with a specified set of instructions, we have been able to con-

struct an operating model of a plausible multiple-processor computing system.

This system (the “Z-machine”) has been implemented, with a maximum of

15 processors, using the CISCO simulator (see Appendix A). Three example

algorithms have been coded and run on the simulated Z-machine. By varying

the maximum number of processors in the system and gathering data on the

run time and many other measures, we have been able to answer the following

questions within the context of the Z-machine:

1. Can multiple-processor algorithms be coded in a reasonable length

of (microprogram) memory?

2. How does the run time of a problem change with increasing numbers

of available processors?

-ll-

3. Is there a practical limit on the number of processors which can

be used? In what way does the limit depend on the nature of the

algorithm ?

4. How do we determine the optimal number of processors, based

on their cost?

The example algorithms and their implementations are described in

Chapter 3. The measurement parameters and the major results are pre-

sented in Chapter 4. In addition to the questions above, the measurements

have led to conclusions regarding the performance tradeoffs of the hardware

organization. The results in Chapter 4 give answers to the following questions:

1. Is a common data bus a viable means of communication among

processors and memory?

2. What bus capacity (bandwidth) is necessary to avoid serious

bottleneck phenomena?

3. What happens to efficiency, run time, and parallelism as the

bus becomes a bottleneck?

4. What happens to efficiency, run time, and parallelism as main

memory speed varies.

Literature Survey and Annotated Bibliography

In this section we review the origins of some multiple-processor concepts

and attempt to identify those thoughts in prior published work which point

toward asynchronous low-level parallelism.

We begin by citing separately a body of work which begins with the well-

known paper by Aschenbrenner, Flynn, and Robinson [ 31 and ends - in this

survey only - with an article by Flynn and Podvin [ 14). Professor Flynn

has done much to classify the various approaches to computer organization;

-12-

his design preferences have, however, served more as a stimulant than as a

model for our work.

We include in this group a series of papers on the IBM System/360

Model 91 which, with its asynchronous arithmetic units and common data bus

in the floating-point units, has also stimulated our interest in multiple pro-

cessor systems. The Model 91 also has a special significance; we used slightly

over 23 hours of CPU execution time on the Model 91 at the Stanford Linear

Accelerator Center in generating the simulation results presented here.

We also include here two papers by Chen [ 6,7] in which he has defined

several of the concepts we. have used, such as the “space-time” of a job,

which we call “area”.

1. Anderson, S. F . , Earle, J. G. , Goldschmidt, R. E., Powers, D. M.,

The IBM System/360 Model 91: floating-point execution unit, IBM J.

R&D (Jan. 1967), 34-53.

Pipelined, independent execution units I’. . . provide

concurrent instruction execution at the burst rate of one

instruction per cycle. l1

2. Anderson, D. W., Sparacio, F. J., Tomasulo, R. M., The IBM System/

360 Model 91: machine philosophy and instruction-handling, IBM J. R&D

(Jan. 1967), 8-24.

3. Aschenbrenner, R. A., Flynn, M. J., Robinson, G. A., Intrinsic

multiprocessing, SJCC (1967), 81-86.

Fairly detailed description of multiple-processor effect

by sharing arithmetic and other functional units among many

pseudo-processors (instruction-decode streams). Brief

description of executive functions. Very centralized execu-

tive strategy. Passing mention of intra-program parallelism.

-13-

4. Boland, L. J., Granito, G. D., Marcotte, A. U., Messina, B. U.,

Smith, J. W., The IBM System/360 Model 91: storage system,

IBM J. R&D (Jan. 1967), 54-68.

5. Chen, T. C., The overlap design of the IBM System/360 Model 92

Central Processing Unit, FJCC (1964), Part II, 73-80.

Early discussion of Model 91 (to be) organization.

6. Chen, T. C., Unconventional superspeed computer systems, SJCC

(1971), 365-371.

‘Job profile’, ‘space-time’ and ‘parallelism ratio’

definitions; lmicro-multiprogrammingl: “the workload

characteristic to be exploited is job independence, rather

than the much more restricted job parallelism.”

7. Chen, T. C., Parallelism, pipelining, and computer efficiency, Computer

Design 10, 1 (Jan. 1971), 69-74.

Includes lmicro-multiprogramming, f automatic smoothing

of low level congestion, such as in a cache.

8. Cook, R. W. and Flynn, M. J., System design of a dynamic micro-

processor, IEEE Trans. C-19 (Mar. 1970), 213-222.

“DMP” machine design - an extremely flexible pro-

cessor with writable micro memory. Microinstruction

sequencing determined in each microinstruction; examples

of table search, division, square root; comparison with

System/360. Assembler language for microprograms

defined.

9. Flynn, M. J., Very high speed computing systems, Proc. IEEE 54, 12

(Dec. 1966), 1901-1909.

-14-

Classifies computer systems with respect to single/

multiple instruction and data stream, and discusses known

techniques for augmenting the performance of these systems

and their limitations. Defines “bandwidth”, YatencyYt,

%onfluence”. ‘I. . . a restricted MIMD (multi-instruction

stream, multi-data stream) may be organizationally much

simpler than a confluent &SD”.

10. Flynn, M. J., Organization of computing systems, Notes from Computer

and Program Organization, University of Michigan Engineering Summer

Conference, 1970.

11. Flynn, M. J., Shared internal resources in a multiprocessor, IFIP 71,

TA-4, 7-11.

Improved usage of hardware resources by having multiple

“skeletonn processors (interpreters) in time-divided rings.

Conditional fork (“as available”) suggested. Wub-commutation”

described to recover otherwise unused time slots. Rotating

priority among rings guarantees minimum access to physical

resources.

12. Flynn, M. J., MacLaren, M. D., Micro-programming revisited, ACM

NC (1967), 457-464.

A broad discussion of the whys and hows of micro-

programming. Good reading.

13. Flynn, M. J., Low, P. R., The IRM System/360 Model 91: some remarks

on system development, IBM J. R&D (Jan. 1967), 2-7.

14. Flynn, M. J., Podvin, A. , Shared resource multiprocessing, Computer

(Mar/Apr 1972)) 20-28.

Shared functional units include multiple instruction

decoding units.

-15-

15. Flynn, M. J., Pod&n, A., Shimizu, K., A multiple instruction stream

processor with shared resources, in ~~Parallel processor systems,

technologies, and applications,” ed. by L. C. Hobbs et al., Spartan --

Books (1970).

16. Schwartz, J. , Large parallel computers, JACM 13, 1 (Jan. 1966), 25-32.

“Athene” type, multiple instruction location counter

machines.

17. Senzig, D. N., Observations on high performance machines, FJCC (1967),

791-799.

Shared functional units among multiple CPUs.

18. Tjaden, G. S. , Flynn, M. J. , Detection and parallel execution of

independent instructions, IEEE Trans. C-19, 10 (Oct. 1970), 889-895.

19. Tomasulo, R. M., An efficient algorithm for exploiting multiple arith-

metic units, IBM J. R&D (Jan. 1967), 25-33.

Common data bus in the floating-point unit.

20. Tucker, A. B., Flynn, M. J., Dynamic microprogramming: processor

organization and programming, CACM 14, 4 (Apr. 1971), 240-250.

“A dynamically microprogrammed processor is charac-

terized by a small (4k 64-bit word) read-write microstorage. ”

(Abstract) 50 nsec control, 500 nsec main fetch times.

3 examples of microcode integer multiply, linear table search,

Fibonacci number generation. “These operations are primitive

in the sense that executing a microinstruction requires no

sequential interpretation of an fop code’ by the execution unit. ”

“Microstorage words are assigned addresses of 0 through 4095

and main storage words are assigned addresses of 4096 and

-16-

higher. I1 16-deep microsubroutines. I’. . . judicious selection

and efficient microprogramming of a problem oriented macro

set merit extensive research. It

The next group of papers covers a variety of opinion concerning computer

organization. We find support of Flynnls shared functional units in Allen and

Pearcy [ 211. But a contrary stream of development supports the idea of

having complete, independent processors in anonymous pools. Conway [ 221

proposed a system in which nprocessors have no identity of their ownn

Gountanis and Viss [ 271 similarly advanced a “multiprocessor system . . .

(which) achieves complete processor anonymity by eliminating any processor

specialization for arithmetic processing, input/output command, and interrupt

handling. ” Curtin [ 231 also argued that “each computer should contain all the

conventional equipment for instruction execution. It should be possible to

program various computers to provide a wide variety of functions which

precludes a built-in division of labor among the identical computers of a logical

or arithmetic nature. I1

We find the idea of multiple asynchronous processors in many places,

including Curtin [ 231, Lehman [ 291 and Rosenfeld [ 331 and Rosenfeld and

Driscoll [ 341; these articles inspired us to undertake the simulation-by-

instruction-execution approach. As Lehman [ 291 said, “It is the insight and

understanding gained from the design of simulation experiments and the analysis

of their results that draws attention to specific details and difficulties. ‘I

The coupling of multiple processors with microprogramming appears in

Davis and Zucker [ 251, Lesser [ 311, and Stevens [ 351 . This leads us to ask

at what level of programming we should apply the processors. Gosden [ 261

assumed that tlprogrammers will specify parallelism only if it is easy and

-17-

straightforward to do so.” Allen and Pearcy [ 211 said, “the tendency is for

programmers to avoid functions which are not easily grasped . . . . A pos-

sible solution is . , . direct implementation of many functions needed in higher

level languages. I1 Curtin [ 231 adds, “For multiple computers, the success

or failure of a hardware organization may rest solely upon the ease of

programming. . . . ”

For these reasons, we have chosen to try to elevate machine language

operations to form easily used constructs. Davies’ [ 241 broad review of

microprogramming summarizes, Yt is fairly clear that, what ever the opti-

mization [of instruction repertoire] methods might be, they should not

require that users actually microprogram nor even understand micropro-

gramming. Implementation should be through disciplined and well-controlled

combinations of architecture, language processor, and operating system

services, with the user’s interface as straightforward and as far from the

detailed microcode level as possible . . . . It In particular, we do not look for

user-defined parallelism to give us a breakthrough in machine parallelism; as

Gosden claimed, for example, the FORK-JOIN in high-level languages “(do)

not show great gains. l1

Regularity of structure in the hardware, from LSI considerations, is

very desirable. Allen and Pearcy [ 211 argue well for this, and claim, “A

desirable aim is to use the increase in logic content to reduce software

complexity as well as improving execution speed. I1 Correlated with regularity

of structure is a need to keep the control from becoming too complex.

Gosden [ 261 says, “A simple control scheme is needed to provide straight-

forward scheduling. l1 And further, “Agood control scheme . . . must be simple

so that the overheads of implementing it do not seriously erode the advantages

-18-

gained. Second, it should be efficient with simple ad hoc scheduling algo-

rithms. IT Low overhead means fast allocation. Gunderson, Heimerdinger,

and Francis [ 281 presented a system in which ‘I. . . the capability of per-

forming allocation and assignment functions rapidly is used to provide

increased flexibility by making the duration of processor assignments (the

length of segments) relatively short so that processors are frequently avail-

able for reassignment. ”

Amortization of allocation overhead over the length of the parallel paths

can become important. As Gosden [ 261 said, “major benefits will accrue only

if . . . either many parallel paths exist or, if the parallel paths are few, they

must be long. It Conway’s [ 221 system also envisioned possible long paths:

Tn fact, external interrupts are unnecessary. Consider that an I/O instruc-

tion is simply a very lengthy one which may be executed in parallel with other

code . . . .I’

On the other hand, Conway also leads us to the idea of multiple processors

working on one task: ‘I. . . notice that the FORK-JOIN approach provides no

justification for distinguishing between parallelism within a program and

parallelism between programs. t1 Conway was not talking here about micro-

code, but the concept still applies. The whole approach of Lehman [ 291 is of

multiple processors on one task. Gosden [ 261 also pointed out “savings in

internal system overheads when parallel activity on one task replaces parallel

activity on different tasks. If

We have gone the direction of providing our system with many processors

of low computational capacity. Mallach [ 321, in a guidance computer study,

said, ‘I. . . it was found that a multiprocessor with many slow processors is

more efficient than one with a few fast processors. If Gunderson, Heimerdinger,

-19-

and Francis [ 281 gave the converse argument. “In fact, the need for parallel

processing is more important in a system with relatively weak processing

modules in order that it be able to execute a high-priority program in a rea-

sonable length of time. ”

The only previous application we have found of a variable number of

processors to a single problem is in Lehman [ 291 and Rosenfeld [ 331.

Lehman argued, “Within a system in which a maximum of p processors are

available to a job, it is pointless to partition a job, at any one time, into more

than p tasks. It is, however, undesirable to guarantee a user that p pro-

cessors, or even more-than one processor, will execute his program. 11

We also find in Gosden [ 261, “Processes generate parallel paths dynami-

cally one at a time.. . . Distribution of parallel paths among varying num-

bers of processors is an easy ad hoc process. ” Finally, shared code among

processes is mentioned by Conway [ 221: ‘I. . . the same section of code can

be executed simultaneously in two parallel paths. It

21. Allen, M. W., Pearcy, T. , Developments in machine architecture,

Proc. Fourth Australian Computer Conference (1969), 227-230.

Good general motivation for multiple-microprocessor

approach, starting from LSI considerations. 24 references.

22. Conway, M. E. , A multiprocessor computer design, FJCC (1963),

139- 146.

23. Cm-tin, W. A., Multiple computer systems in Alt, F. L. and Rubinoff, M.

(eds.), Advances in Computers, Vol. 4, Academic Press (1963), 245-302.

A good introduction to and motivation for multiple identical-

processor systems. States some criteria of “success”, also

refines the definition of “multiple computer”. Curtin’s multiple

-2o-

processor model does not provide for simple processor-

processor transfers, and the programming seems to bog

down due to this. Contains a programming example; plus

a survey of existing (in 1963) multiple computer systems.

Considerable explanation of possible bus structures.

Sample of excellent “motivation” section: “if a multiple

computer is worth building at all, it should at least solve

some class of problems within some specified time limit:

which limit would be exceeded by any single computer working

on the same task. Also . . . it should, at least some of the

time, be used to execute a single task which requires all or

most of the available equipment.” “The question to what

extent a multiple computer utilizes its hardware more

efficiently is, however, a secondary to the time factor. I’

24. Davies, P. M., Readings in microprogramming, IBM Systems J. 11.

1 (1972), 16-40.

An enlightening and clearly written tutorial, with

annotated references to all the essential articles through

1970. “The fact that a machine is running at maximum

storage data bandwidth does not necessarily imply that a

particular programmed task is being executed at the maxi-

mum rate that could be achieved . . . (varying) the organi-

zation and representation of . . . (the program) and its

associated data. ‘I (p. 27)

25. Davis, R. L., Zucker, S., Structure of a multiprocessor using micro-

programmable building blocks, SIGMICRO Newsletter (ACM) (Oct. 1971),

27-42.

-21-

Burroughs %terpreter” described in fair detail, with

claims for multiple processor configurations. “The Switch

Interlock is the means for interconnecting virtually any

desired number of Interpreters into a unified system. . . . ‘I

The microprogramming design techniques are very interesting.

26. Gosden, J. A., Explicit parallel processing description and control in

programs for multi- and uni-processor computers, FJCC 66, 651-660.

Good survey of the approaches to parallel processing.

Goes into most detail for the FORK-JOIN, and parallel

FOR constructs for high-level languages. Emphasis, in

this part on sharing re-entrant code.

27. Gountanis, R. J. and Viss, N. L., A method of processor selection

for interrupt handling in a multiprocessor system, Proc. IEEE 54, 12

(Dec. 66), 1812-1819.

1, . . , the multiprocessor system is based on a pool of

anonymous processors. If “The supervisory or executive

program is maintained in the shared storage (main memory)

and is executed by any or all processors as required within

the system. I’ “In order for a system to degrade gracefully,

either its various units must be duplicated, . . . or the units

must be failproof. All three approaches are represented

in the system described here. ”

28. Gunderson, D. C., Heimerdinger, W. L., Francis, J. P., A multi-

processor with associative control, in “Prospects for simulation and

simulators of dynamic systems, I1 183-200 (see CR 13916) Shapiro, G.

andRogers, M. (eds.), Spartan Books (1967).

-22-

(Dynamic control over processor assignments.)

Multiple processors in a conventional organization except

for the use of associative memories to aid in allocation of

processors, segmentation, etc. Gives argument for short

segments, causing frequent reallocation of processors.

29. Lehman, M., A survey of problems and preliminary results concerning

parallel processing and parallel processors, Proc. IEEE (Dec. 1966),

1889-1901.

“After an introduction which discusses the significance

of a trend to the design of parallel processing systems, the

paper describes some of the results obtained to date in a

project which aims to develop and evaluate a unified hardware-

software parallel processing computing system and the

techniques for its use.” (abstract)

30. Lehman, M., Serial mode operation and high-speed parallel processing,

IFIP (1965), Part 2, Spartan Books (Oct. 1966), 631-632.

“The relative efficiency of a multiprocessor operating

in a serial mode and a conventional processor operating in

parallel depends critically on the extent to which macro-

parallelism (multiprocessing) is achievable in the former

microparallelism (instruction look-ahead) in the latter.

Only the accumulation of significant operating experience

can decide these questions. It

31. Lesser, V. R., An introduction to the direct emulation of control

structures by a parallel microcomputer, IEEE Trans. C-20, 7

(Jul. 1971), 751-764.

-23-

A very general model of emulation and instruction

execution is developed within a context of multiple asyn-

chronous microprocessors as the underlying support.

32. Mallach, E. G., Analysis of a multiprocessor guidance computer,

Report T-515, Instrumentation Laboratory, M. I. T., Cambridge, Mass.

(June 1969) (thesis).

33. Rosenfeld, J. L. , A case study in programming for parallel-processors,

CACM 12, 12 (Dec. 1969), 645-655.

Exhibits application of multiple processors using a

common memory to a network calculation (using a “chaotic

relaxation” method). Basic processor modeled is an abbre-

viated System/360 with some additional instructions: Fork

(‘%plit7’), Join (“Terminatelt), Test and Set, Branch on Index

in Storage (cf. , Dijkstra’s V(x)). Interesting results are

presented on (1) problem run time, (2) storage interference,

(3) processor and storage utilization, and (4) a run-time-

related “quality factoP. The work reported is carefully

delineated and the paper contains cogent comments relevant

to more general problems.

34. Rosenfeld, J. L., Driscoll, G. C. , Solution of the Dirichlet problem on

a simulated parallel processing system, IFIP (1968), Software 2,

Booklet C, 24-28.

“Four relaxation algorithms were programmed . . . on

an instruction-executing simulator for a parallel processing

system. I1

-24-

35. Stevens, S. L., A multiprocessed, microprogrammed processor con-

figuration with engineered processor pools, Technical report, Bell

Laboratories, Naperville, Illinois (1971), 5pp.

A rather general proposal advocating both micropro-

gramming and multiple processor configurations.

The next group of papers is somewhat peripheral to this work. They

have to do with higher-level language implementation, parallelism in high-

level languages, and language constructs for the control of parallel processors.

36. Bjorner, D., Review 22,362, of “SYMBOL - a large experimental sys-

tem exploring major hardware replacement of software, I1 by

W. R. Smith etg., SJCC (1971)) 601-616, Computing Reviews

(Dec. 1971), 580-581.

II . . . to get at high-level language machines one may

do better by implementing . . . the primitives best suited

for compile-time language translation and execution-time

language interpretation. 11

37. Firestone, R. M., Parallel programming: operational model and de-

tection of parallelism, Ph.D. thesis, New York University (May 1971).

I, . . . attention is focused on multiprocessor hardware

and programs which have portions executed concurrently. . . .

The degradation due to storage interference is analyzed . . .

in the context of a block-structured language with parallel

execution. It

38. Iliffe, J. K., Basic machine principles, American Elsevier Publishing

Co. (1968), 86 pp.

-25-

39. Meggitt, J. E. , A character computer for high-level language inter-

pretation, IBM Systems J. 3, 1 (1964), 68-78.

“The design was simulated on standard equipment. I1

40. Opler, A., Procedure oriented language statements to facilitate parallel

processing, CACM 8 (May 1965), 306-307.

Demonstrates a plausible approach to parallelism in

high-level language source code - also shows the restric-

tions necessary for implementation.

41. Walden, D. C., A system for interprocess communication in a resource

sharing computer network, Proc. 4th Hawaii International Conference on

System Sciences (1971), 640-642.

Short discussion of generalized communication

primitives; possibly applicable to one-site multi-

processor systems.

The group of papers next below shows some of the applications of multiple-

processor systems.

42. Bauer, W. F., Why multi-computers?, Datamation 6 (Sep. 1962), 51-55.

Early discussion of the motivation for multiple pro-

cessor systems.

43. Bell, G., Freeman, P., ett., C. ai: A computing environment for

ai research - Overview, PMS and operating system considerations,

Department of Computer Science, Carnegie-Mellon University, Pittsburgh,

Pa. (May 1971).

Proposal for a large, parallel machine for research

in artificial intelligence.

-26-

44. Dorff, E. K., A multiple minicomputer message switching system

Computer Design 11, 4 (Apr. 1972), 67-73.

45. Langley, F. J., The universal function unit concept for computing

applications, Computer Design 11, 4 (Apr. 1972), 87-91.

“Single bus multicomputer system,tl Fig. 4, p. 89.

46. Miller, W. F. , Aschenbrenner, R. , The GUS multicomputer system,

IEEE Trans. EC-12 (Dec. 1963), 6’71-676.

“As many as seven independent computers or devices

are allowed to operate on the memories. . . . ” Application

to graphics, pattern recognition research.

47. Mutzeneek, J. I., Using mini-computers in systems engineering,

Electronic Engineering (Feb. 1971)) 45-47.

“Separate processing units . . . under the control of

another mini-computer. ”

48. Rohrbacher, D. L. , Advanced computer organization study, final

report, Report RADC-TR-66-7, Vols. I and II, Rome Air Development

Center (Apr. 1966), AD 631870 (Vol. I), AD 631871 (Vol. II).

High-level investigation, with specific machines

designed for sorting and parallel compilation. Involved

in the project: J. Holland, K. E. Batcher, C. C. Foster,

as well as the author cited.

49. Wald, B. , Utilization of a multiprocessor in command and control,

Proc. IEEE 54, 12 (Dec. 1966), 1885-1888.

General discussion of the D-825 and some imple-

mentation problems and approaches.

-27-

The books cited below are good references for computer design in

general.

50.

51.

52.

53.

54.

55.

Bell, C. G., Newell, A., Computer structures: readings and examples,

McGraw-Hill (1971), 668 pp.

Buchholz, W. (ed.), Planning a computer system - project STRETCH,

McGraw-Hill (1962)) 322 pp.

Husson, S. S., Microprogramming principles and practices, Prentice-

Hall (1970).

Lorin, H., Parallelism in hardware and software - real and apparent

concurrency, Prentice-Hall, Inc. (1972), 508 pp.

Sbarpe, W. F., The economics of computers, Columbia University

Press (1969).

Thornton, J. E., Design of a computer - the Control Data 6600,

Scott, Foresman and Co. (1970), 181 pp.

A terse historical, motivational, organizational and

functional description of the 6600 by the man who “was

personally responsible for most of the detailed design. ”

Includes some circuit and logic design, especially of

arithmetic units.

The remaining group of papers, below, cover a variety of topics related

to multiple processor computer systems.

56. Baskin, H. B., Horowitz, E. B., Tennison, R. D., andRittenhouse,

L. E. , A modular computer sharing system, CACM 12, 10 (Oct. 1969),

551-559.

11 . . . a general purpose interactive multi-terminal

computing system is presented. . . , a bank of interchangeable

-28-

57.

58.

59.

60.

61.

computers, . . . are assigned to process terminal jobs as

they arrive. ‘I (abstract) Uses a crossbar switch, master

(control) processor.

Buchman, A. S., Aerospace computers, in Alt, F. L. , and Rubinoff, M.

(eds.), Advances in Computers, Vol. 9, Academic Press (1968), 234-284.

Emphasis on reliability; error recovery; packaging.

Burnett, G. J., Koczela, L. J. Hokom, R. A., A distributed processing

system for general purpose computing, FJCC 67, 757-768.

Ideas for arrays of processors, a la Solomon, are

partially developed. %aturalt’ and “applied” parallelism.

Develops some approximate curves - relating storage per

cell to parallelism and efficiency vs. parallelism.

Codd, E. F., Multiprogramming, in Alt, F. and Rubinoff, M. (eds.),

Advances in Computers, Vol. 3, Academic Press (1962), 77-153.

Multiple processing units, at high level.

Duggan, P. N., Multiple instruction stream microprocessor organization,

IBM Technical Disclosure Bulletin 14, 9 (Feb. 1972), 2739-2740.

‘IAn instruction pipelining technique provides for execu-

tion of micro-instructions simultaneously from control

storages of four logical micro-processors. . . .I’

Kartsev, M. A., On the structure of multiprocessor systems, IFIP 71,

Booklet TA-4 (“Hardware and Systems”), l-6.

Classifies 4 types of multiprocessor systems: (1) asyn-

chronous, (2) synchronous with separate control, (3) common

control but separate instructions, (4) common instructions.

Proposes a combination of all four, using ~~bund.lest~ of closely

connected, synchronous processors communicating

asynchronously with each other. Some talk on memory

organization, problem suitability is included.

62. Koczela, L. J., The distributed processor organization, Advances in

computers, Vol. 9 (1968), 285-353.

Elaborate system of processors (in groups) which can

operate in concert or independently, at many levels of hier-

archy. Much attention paid to reconfigurability, I/O

structure, transition among operating modes. Intermodule

connections are described in detail. No example programs

are shown in this paper. It appears that the software may

be very complex.

63. Leiner, A. L., Notz, W. A., Smith, J. L., Marimont, R. B., Con-

currently operating computer systems, Proc. Int*l Couf. on Information

Processing UNESCO, Paris, 15-20 June, 1959 (R. Oldenbourg, Miinchen,

1960) 353-361.

Three-processor NBS tlPILOTt’ system described.

Specialized processors, messages between them, handling

of 2 level storage system explained in considerable detail.

Some Boolean matrix techniques for detecting program step

dependencies explained.

64. Macnaughton, P. C. , Virtual multiprocessors, CIPS Magazine, 12-14,

20-24.

Classifies ~~multisystems, 71 primary objectives are

self-repair, reliability, expandability, modularity.

-3o-

65. Macnaughton, P. C. , Lee, E. S. , The virtual multiprocessor - a new

computer architecture, IFIP (1968), Hardware 2, Booklet E, 39-43.

“Task managers, execution units, communication units,

and assignment units” are the four module types.

66. Miller, J. S., Lickly, D. J., Kosmala, A. L. , Saponaro, J. A. ,

Multiprocessor computer system study, final report (NASA-CR-108654)

(March 1970), 165 pp.

Excellent survey of 29 multiple processor computers.

Design study with emphasis on graceful degradation and

reliability for spacecraft application. Good review of

computer architectural considerations, in general and

specifically for reliability. Memory allocation, interrupts,

stacks, microprogramming all discussed in review (Ch. 3).

High-performance, reliable design schemata presented in

Ch. 5; uses buffer memory, microprograms in read-write

storage, stacks, relocation of memory pages; redundancy.

Appendix A presents survey of segmentation and paging on

7 existing machines.

67. Potvin, J. N., Chenevert, P., Smith, K. C., Boulton, P., Star-Ring:

a computer intercommunication and I/O system, IFIP 71, TA-4, 156-161.

Describes hardware used to interconnect several

computer systems and I/O devices, overcoming some

of the usual bus difficulties by constructing a high-speed

ring bus, with ports onto more normal busses. Cyclic

access to ring. A five port prototype was implemented.

-31-

68. Rice, R., LSI and computer system architecture, Computer Design 9,

12 (Dec. 1970), 57-64.

69. Riegel, E. W. , Parallelism exposure and exploitation in digital computing

systems, Report TR69-4, Burroughs Corp. (Apr. 1969), 255 pp. (thesis)

“Language constructs are suggested that provide explicit

indication of parallelism at the task level (routines and repeat

statements). ” “Multiple computer systems are examined and

compared based on homogeneity and interunit communication. I’

Good survey and review.

70. Smith, P. A., Variable architecture computer, IBM Technical Dis-

closure Bulletin, 13, 9 (Feb. 1971), 2777-2778.

“A computer is illustrated having an architecture that

may be restructured by microprogramming. ” “The multi-

processing configuration, . . . effectively (divides) the system

into two separate processors.. . . If There is no restriction

as to which main store supplies microinstructions to the

control registers. I1

71. Stern, L., PDP-11 parallel multiprocessor, DECUSCOPE 9, 5

(Dec. 1971), 7-8.

Suggests how to hook up multiple PDP-11s using the

Unibus as a common bus.

72. Thompson, R. N. and Wilkinson, J. A., The D825 automatic operating

and scheduling program, SJCC 63, 41-49.

Parallel processing of independent tasks.

73. Ware, W. W., The ultimate computer, IEEE Spectrum (March 1972),

84-91.

-32-

lfConceptually, a multistream organization looks

attractive because it executes concurrently more than

one stream of operations, but experience with such

machines is limited. Practically, present multistream

designs are constrained to the concurrent execution of

the same operation, but on several strings of operands. I1

TV. 89)

Abbreviations used in the Bibliography:

SJCC Proc. AFIPS Spring Joint Computer

Conference

FJCC Proc. AFIPS Fall Joint Computer

Conference

IBM J. R&D IBM Journal of Research and Development

IEEE Trans. C- IEEE Transactions on Computers,

Volume C-

IFIP un Proc. IFIP Congress (19nn)

ACM NC Proc. ACM National Conference

JACM Journal of the ACM

CACM Communications of the ACM

-33-

II . . . multiprocessing organization has to this point seemed to have had little effect upon the architecture of the processors that exist in the system. ”

Harold Lorin, t’Parallelism in Hard- ware and Software” (1972)

CHAPTER 2

COMPUTER HARDWARE DESCRIPTION

The Multiple-Processor Model

The choice of a specific hardware environment for trying out a multiple-

processor approach to computing is, to a great extent, a matter of taste. We

feel strongly, however, that it is worthwhile to choose a particular config-

uration which will impose restrictions of the type which would be encountered

if the system were implemented physically. Some of these restrictions will

be found to have no effect on the main results; others have possibly limited the

scope of validity of the results.

Following are some of the prejudices and presumptions which led to the

selection of the configuration described in this chapter.

1. The economics of integrated circuits demands maximum repeti-

tion of few parts. All processors should therefore be truly

identical and interchangeable. (An arithmetic capability was

rejected on this basis. )

2. Serious consideration should be given to keeping the number of

pins (wiring connections) on a processor to a minimum. Having

chosen to place the microprogram store external to the pro-

cessor, this led to the choice of a very narrow microinstruction

width - 8 bits.

- 34-

3. Related to 2, above, no great emphasis should be placed on

minimizing the number of gates in a processor.

4. Allocation overhead is presumed to be low. In attempting to

make reasonable use of multiple processors within the exe-

cution of presumed machine-language instructions, it is clear

that processors must be attached and detached quickly. The

chosen model is able to allocate a free processor in a time

equal to two microinstruction cycles (for typical parameter

settings).

5. Correlated to 4, above, we expect and desire to have long

sequences of microinstructions. In any tradeoff, for example,

between a greater number of pins (required for addressing

capability) and longer microprograms, we have leaned

strongly toward the latter.

6. The microinstructions which request allocation and which

perform interprocessor communication are to be simple and

direct. The desire for simplicity (in both hardware and

programming) led to the restriction of a maximum of only

one llslaveJ’ processor allocated to a processor. This

restriction also leads to a desirable simplicity in the allo-

cator mechanism.

7. Timing of microinstructions should be uniform, even though

the separate processors are not synchronous. This led to

the omission of multiply and divide instructions, and to the

decision to avoid trap and interrupt mechanisms.

-35 -

8. Since all processors are identical, we should make them

‘lanonymousf’ so that no dependence on physical processor

number can arise. For this reason, physical processor

numbers are not made available to the microprogrammer.

9. The choice of microinstructions, other than interprocessor

communication and allocation primitives, is not extremely

important. An intuitively arrived-at set of instructions was

modified only slightly during the course of development.

(The shift amounts were changed, and a store register was

substituted for Exclusive OR,)

Note that the allocation mechanism used in this organization does not

depend on having a bus type of communication path. The allocator could as

easily reside within a crossbar switch arrangement. The main feature of

the allocator from the programming point of view is that a processor exe-

cuting a FORK sends a single message to “processor zero”; the allocator

fills in the number of a destination processor, if available, and generates an

acknowledgment message.

This organization is particularly applicable to computation on problems

which can be structured to have relatively little global interaction. For ex-

ample, it is applicable to problems in which a chain of processors can

operate on data in pipeline fashion. Example 1 has made a bubble sort al-

gorithm into such a program, which we call a “sifting” sort. Other problems

for which it would work well are those with generally independent activity.

Example 3, a matrix multiply, is one such problem. Others are: some file

search algorithms, relaxation methods for partial differential ecpation

solution, top-down parsing algorithms, some pattern-recognition methods,

-36-

and simulation programs which have parallel processes. Certain input/ output

operations and other computer operating system functions could be handled by

independent processors.

This organization is not especially well suited to problems which have

complex data interaction. For example, the Fast Fourier Transform algo-

rithm would invoke a great deal of coordination overhead through main memory

due to the limited connectivity of the processors. Example 2, the Shell sort

encounters this kind of difficulty, some of which was resolved by using some

of the data locations for coordination semaphores. In general, this organi-

zation relies to a high degree on the skill of the microprogrammer in finding

an appropriate structuring of the data and control mechanisms needed.

Some particular limitations of the model described in this chapter are

the following:

A. The choice of a common data bus to interconnect the processors and

main memory does not affect the programming of the algorithms, but it does

have some other consequences.

1. By varying the speed of the common bus, we have been able

to study the effects of varying bus capacity, but since all main

memory communications are transmitted across the bus, it

was not possible to reduce the memory access time below two

bus cycle times. Thus, for example, it was not possible to

simulate the pure case of “infinite” memory speed coupled

with a slow bus cycle time.

-37-

2. The speed of the common bus priority mechanism, which

scans the memory modules and processors, limits the

overall capacity of the bus. Although the scanning

mechanism is practical with 15 processors and four

memory modules, it would not be satisfactory for, say,

100 processors.

-38-

B. This model does not reflect any of the interference in microprogram

memory access which would occur if a common memory were actually imple-

mented. This deliberate simplification has been made to keep the artifacts

of microprogram access from influencing the other design studies. We feel

that a practical, LSI-implemented system would probably have separate

microprogram storage for each processor; this would avoid the access inter-

ference, but introduce the problem of microcode duplication.

C. No special input/output processors or any provision for input/output

is developed in this model.

Plausible extensions of the model would be:

1. Addition of one or more special processors which would have

connections to external devices or channels.

2. Separate ports to main memory which would give access to

I/O channels. All communication would be through main

memory.

3. Addition of control units which would be activated by addressing

special memory locations, in the manner of the DEC Unibus

[ “PDP-11 Handbook, ” Digital Equipment Corporation (1969)] .

Description of the Z Machine

The multiple processor system configuration chosen for study is dia-

grammed in Fig. 2-l.

The Z-machine has been described and simulated in the CISCO language.

The description is divided into four functional blocks (called r’processes’t

in CISCO): A processor module, a data bus, a main memory module, and a

microprogram store. Four main memory modules and fifteen processor

modules are used in the studies reported here. The bus contains the processor

-39-

M M M M MAIN MEMORY

ALLOCATOR DATA BUS

MICROPROGRAM STORE

FIG. 2-l. Z-machine configuration.

allocation mechanism (shown as block “A” in Fig. 2-l) and also contains

most of the statistics-gathering instrumentation and trace-generating state-

ments . Each processor connects to the microprogram store and to one port

of the data bus. A single microprogram store serves all of the processors.

The main memory modules are used as a four-way interleaved memory -

i. e., the low-order two bits of address determine which module is addressed.

Each of these functional blocks is described in further detail below.

The Processor

Each processor interprets microinstructions fetched from the micro-

program store, one at a time. A separate connection between each processor

and the microprogram store allows a processor to fetch a microinstruction

on each cycle of the store. Communication among processors and between a

processor and main memory is done via the bus. Requests for allocation

or deallocation of processors are also passed to the bus, which contains the

allocation mechanisms.

-4o-

A local register named STATUS contains a value which indicates the

execution state of a processor:

STATUS = 0 The processor is unallocated and is waiting for a startup

message from the allocator (l’STOP” state).

STAT US = 1 The processor is waiting for permission to transmit to

the bus (t5US-WATT” state).

STATUS = 2 Microinstructions are being actively interpreted

(“RLJN” state).

STATUS = 3 A Wait for input” instruction has been executed and input

has not yet been received (“WAIT” state).

Logical interconnections between processors are made only by use of the

FORK instruction. When a FORK is executed by a processor, a request is

output to the allocator (transmitted on the bus). The request contains a starting

microprogram address. If an unallocated processor exists, it is allocated to

the requesting processor as its “slave”. The allocator sends a startup message

to the slave processor containing the starting address; then it sends an acknowl-

edgment message to the requesting processor (the 9naster1’).

Each processor contains two local registers, named MASTER and SLAVE.

When a processor is allocated, its MASTER register is filled with the index

number (%amel) of the processor to which it is slave. The master processor

receives the name of its slave in the acknowledgment message, and places it in

the SLAVE register. In case of failure to allocate (i.e., no processor available).,

an immediate acknowledge message is sent to the requestor, and its SLAVE

register is set (remains) equal to zero. All allocation requests are acted upon

immediately (during one or two bus cycles). Requests which fail to receive a

slave are not remembered by the allocator.

-41-

The contents of the MASTER and SLAVE registers may be tested (for =O

or #O) under microprogram control, but their contents may not be otherwise

accessed. Thus the anonymity of individual processors is preserved.

A processor may have at most one slave allocated to it. Its slave may in

turn have a slave, but this condition cannot be directly tested by the master.

Bus transmissions from a processor may be sent only to its master (if any)

or to its slave (if any).

The following are the microinstructions which deal with allocation and

interprocessor communication.

mnemonic code description

FORK request allocation of slave (starting address is in the accumulator)

STOP deallocate self and wait for startup

DET detach self from master (if any) and slave (if any), but continue processing

TSL test if SLAVE= 0 (set completion code)

TMA test if MASTER = 0

OSL

OSLF

OMA

OMAF

WIN

output data (from accumulator) to slave

output data to slave, with flag

output data to master

output data to master, with flag

wait for data input (from master, slave, or memory)

TIN test source of input (set condition code to indicate master, slave, or memory)

TIF test whether flag was received on last input (set condition code)

-42-

Main memory access is accomplished by the following memory-cycle-

initiating microinstructions:

MEMB start memory cycle - destructive (“half-“) read

MEMF start memory cycle - full read (“fetch”) and rewrite

MEMS start memory cycle - store

MEhUVI start memory cycle - read-modify-write (read and lock)

Each of these microinstructions causes a message to be sent to the

appropriate memory bank via the bus, taking the memory address from the

accumulator. The Store microinstruction (MEMS) would be followed by the

microinstruction to send a data word to memory:

OMEM output data to memory

The two read microinstructions would normally be followed by a wait-for-

input (WIN) instruction. The read-modify-write microinstruction (MEMM)

would normally be followed by both a WIN and an OMEM instruction. For both

the MEMS and MEMM cycles, the selected memory bank will wait until a data

word is received.

The destination memory bank (module) number is determined by the two

low order bits of the address. These two bits are copied into the register

named BANK whenever a memory cycle is initiated by the processor. When an

output data to memory (OMEM) microinstruction is executed, the module

number is indicated by the value in BANK.

Each processor contains an accumulator (called A) plus a block of eight

registers, named R(0) through R(7). R(0) is dedicated to use as the micro-

program counter. The following eight microinstructions operate on these

-43-

registers.

AR LR INCR EXR SR NR OR STR

Add R(r) to A, result to A Load A from R(r) Add one to R(r) Exchange A with R(r) Subtract R(r) from A, result to A AND R(r) into A OR R(r) into A Store A into R(r)

The last four of these microinstructions are not permitted to operate

on R(0).

Associated with the accumulator are an overflow bit and a link bit. The over-

flow is set to 1 whenever an addition or subtraction operation involving the ac-

cumulator produces a result which cannot be stored in 16 bits ( in signed %vo?s

complement71 representation). The link bit is used in certain shift operations.

Shifting may be performed left (SHL) or right (SIIR) in each of four modes:

logical (LN), arithmetic (AN), circular (CN), and circular with link (CL). The

shift amount (number of bits) of a single microinstruction may be 1, 3, 8, or

the number stored in R(1). A typical shift instruction, in assembler format,

would be

SHR LN, (1) Shift right, logical [no link- , one bit

The following miscellaneous microinstructions operate on A and the over-

flow and link bits :

INCA DECA -A ALNK LNKO LNKl -LNK ACV

add 1 to A- subtract 1 from A complement A add link bit to A set link to zero set link to one complement add overflow bit to A

The test microinstruction sets the condition code to a value depending on the -

contents of the accumulator; the condition code is tested in the conditional skip

-44-

microinstruction. These cause 1, 2, 3, or 4 following microinstructions to

be skipped if the tested condition is satisfied.

SKn ov skip n if overflow = 1 SKn GT skip n if A > 0 SKn LT skip n if A < 0 SKn EQ skip n if A = 0 SKn NO skip n if overflow = 0 SKn LE skipnifA<O SKn GE skip n if A 2 0 SIbI NE skip n if A + 0

The condition code is also set by microinstructions ThU, TSL and TIF.

The appropriate conditional skips following these are SKn EQ or SKn NE.

After a TIN (test source of input), the conditional skips are

SKn MA skip n if input from master (GT) SKn SL skip n if input from slave (LT) SKtl MEM skip n if input from memory (EQ) SKII 1MA skip n if input not from master (LE) SKn 7SL skip n if input not from slave (GE) SKtl -IMEM skip n if input not from memory (NE)

The correspondence between these and the previous set of conditional skips

is shown in parentheses.

Two “immediate” operand microinstructions are provided to allow intro-

duction of data or addresses from the microinstruction stream. These are:

LI i Load A with immediate value

AI j Append j to value in A

The operand i is six bits wide, composed of a sign and 5 bits of magnitude.

The sign bit is copied to the left to fill the upper 11 bits of A, including its

sign bit. The operand j is five bits wide. When AI is executed, A is shifted

left (logical, no link) five bits, and the value j is placed in the lower 5 bits of

A.

An assembly language variant of LI occurs, called LAI, “load A with imme-

diate address”. This is the same as LI, but the assembler will flag any value

-45-

of i which is negative or exceeds 5 (rather than 6) bits in width.

Communication between the bus and a processor is generally independent of

the microinstruction stream. When output to the bus is initiated by execution

of a FORK, OSL, OSLF, OMA, OMAF, OMEM, MEMH, MEMF, MEMS or MEMM

microinstruction, the processor does not wait for the output to be accepted by -

the bus. Only if another output microinstruction (including STOP) is executed

before the bus accepts a previous communication will the processor go into a

bus-wait state (STATUS = 1). Similarly, input data communication are buffered

until a WIN (wait for input) microinstruction is executed. If a second input is

received before a WIN is executed, the first communication is lost, and a fault

condition is noted.

The microprogram store

The microprogram store, called MICROMEM in the simulator, is a single

block of storage, 8 bits wide by 1024 words long, with a mechanism for trans-

mitting one word at a time to each of the 15 processors. The microprogram

location counters of the processors are connected to registers in the micro-

program store. Registers in the microprogram store connect back to the in-

struction registers in the processors. Whenever a nonzero value appears in

one of the input registers of the microprogram store, the value is used as an

address of a word to be fetched, and the word is transmitted to the corres-

ponding instruction register on the same cycle. The input register is then set

to zero. Any number of the processors may be serviced on each cycle.

The main store module

Each of the four main store modules incorporates a block of storage 16 bits

wide by 1024 words long. Whenever a request is received, the module cycles

through four states as described below. The address of a word to be accessed

is received with the request; for %tore7t and “read-modify-write”, a second

-46-

input, containing the data to be written (or rewritten), must be received before

the cycle can be completed.

There are four modes, modeled after the possible modes of access of a

typical core memory:

H: “Half-cycle read” The contents of the word addressed are sent to the requestor. The word contains zero afterwards.

F: “Fetch” The contents of the word addressed are sent to the requestor. The contents are rewritten in the word, leaving it unmodified.

s: “Store” When the data item to be stored is received it is stored in the word addressed.

M: “Read-modify-write” The contents of the word addressed are sent to the requestor. When the data item to be stored is received, it is stored in the word addressed.

The algorithm for the main store module is as follows:

Phase 0 (Wait for request) Sleep until a request is received.

Phase 1 (address clock-in delay) Delay;

Phase 2 (destructive read) Delay Buffer - MEMORY [address] ; If H, F or M, output contents of buffer to requestor; MEMORY [address] - 0; If H, go to phase 0;

Phase 3 (rewrite) Delay; If S or M, wait until data word is received (in buffer); MEMORY [address] +-buffer; Go to Phase 0.

There are four main store modules in the system, named MEMO , MEMl,

MEM.2, and MEMS. The bus directs requests for memory access to one of these

four based on the low-order two bits of the address transmitted. In the algorithm

above, the main store modules use the upper 10 of the 12 address bits received.

The four modules thus form a 4096 word 4-way interleaved memory.

Each module maintains a first-in-first-out queue of all transmissions re-

ceived. Transmissions are always accepted; when the module returns to Phase 0,

it checks the queue before “going to sleep”.

The queue allows access requests and data to be transmitted asynchronously

from all the processors. Each queue element contains information identifying

the source processor; this allows data (for example for a store operation) to be

located in the queue even though the arrival of an access request from another

processor may have intervened between the data and the store request. The

queue also permits the “modified” data of a Read-modify-write access to arrive

at the memory before the original data has been sent to the requesting processor.

The Bus and Allocator

The bus is the sole connection between the processors and main memory and

among the processors themselves. Integral with the bus mechanism in the sim-

ulation is the allocator, which determines the assignment of ‘klave” processors,

and maintains information on the status of processors.

The bus itself transmits 32 bits. The usage of these bits is identified in

Figure 2-2 below.

b 1 D 1 ClMwB 1 +-data-

S = source code (4 bits)

D = destination code (4 bits)

C = bus mode (2 bits)

M = memory mode (2 bits) (significant only when S = 0 and C = 3)

B = memory bank (2 bits) (significant only when S = 0 and C = 1 or 3)

Data length is 16 bits.

FIG. 2-2 Z-machine bus bit assignments

-48-

In general, the source and destination codes correspond to the processor

number, when nonzero; the allocator and memory are destination zero. Bus

modes o and 2 are associated with the allocator, and modes 1 and 3 deal with

transmissions among processors and memory. Figure 2-3 diagrams the eight

possible types of bus transaction.

Bus, mode (C)

Destination (D) 2

zero

nonzero

FORK data to memory DETACH/STOP Address to memory

Fork - startup data to Reset (due data with flag or acknowledge processor to detach) to processor

FIG. 2-3 Z-machine bus transactions

The meaning of these transactions is as follows:

Mode 0 (FORK, startup, and acknowledge). When a FORK micro-

instruction is executed by a processor, the FORK message (con-

taining a startup address) is sent to the allocator. If the FORK

succeeds, the destination is transformed by the allocator into the

number of a free processor, and is transmitted as a startup message.

The next bus cycle is then pre-empted for an acknowledge message

to the FORK requestor. The source code of an aclmowledge message

is the code of the newly allocated processor. If the FORK fails, an

acknowledge message is generated immediately, with a source code

of zero.

Mode 2 (DETACH or STOP). When a DETACH or a STOP micro-

instruction is executed by a processor, a mode 2 message is sent to

the allocator, destination zero. These messages are distinguished

-49-

by the contents of the message. If the data bits are all zero, it is a

STOP-requesting that the processor be deallocated. Otherwise, the

contents of the MASTER and SLAVE registers of the detaching

processor are contained in the data bits. These are verified against

the duplicate master-slave information in the allocator, then one or

two Reset messages are generated by the allocator and sent to the

master (if any) and slave (if any) of the detaching processor. The

reset messages contain new master or slave values for the processors

receiving the messages. The detaching processor is not deallocated,

but is, in this manner, removed from the chain it was in.

The allocator also checks the chain connections of a processor be-

fore allowing a STOP to cause deallocation. If a STOP comes from a

processor which has a master or a slave attachment or both, the

STOP is transformed into a DETACH and acted on as above, followed

by deallocation.

Mode 1 (data). A nonzero destination indicates a simple data trans-

mission to a processor (caused by execution of an OSL or OMA micro-

instruction). Zero destination indicates a data transmission to memory

(caused by an OMEM microinstruction). The memory bank number is

given in the B bits.

Mode 3 (Data with flag). A nonzero destination causes transmission

of data “with flag” to a processor (caused by an OSLF or OMAF

microinstruction). Zero destination causes a memory-cycle-start

message to be sent to memory. The memory mode is determined by

the M bits; an address is contained in the data portion of the trans-

mission. The memory bank number is determined by the B bits

(which equal the lower 2 bits of the address).

-5o-

Status is maintained in the allocator in three arrays, named BUSY, MASTER,

and SLAVE. Each array has one element per processor. BUSY [il is a single

bit telling whether processor i is allocated ( = 1) or not ( - 0). MASTER [d and

SLAVE [i] are four bits each, giving the indices of processors logically con-

nected to processor i in a chain.

On cold start, BUSY 11) is set to 1, and a startup message is sent to pro-

cessor 1, with the startupaddress 000.

The Allocator

The allocator is the arbiter of both bus access and of processor assignment.

Its algorithm is described below.

Algorithm A (Allocation of bus cycles and management of processor activity)

Bidirectional connections exist with each of four memory modules (num-

bered 0 to 5) and fifteen processors (numbered 1 to 15). Each connection may re-

quest access to the bus at any time. A request may consist of a request for trans-

mission of a message to one of the other connections (requiring one bus cycle),

or it may be a request for one of the control actions of the allocator. (requiring at

least one cycle).

Three tables are maintained by the allocator to keep track of the states and

connectivity between processors. BUSY [l:ld is an array whose elements have

a value of 1 or 0, indicating the non-availability or availability, respectively, of

the correspondingly numbered processors. MASTER [ 0:15) and SLAVE [ 0:15] are

arrays which maintain the connectivity between processors allocated via the FORK

command.

Initially, BUSY [i] = 0 f or 15 i 5 15, and MASTER [0] = 0 (always).

On the first cycle, BUSY [I] +- 1, SLAVE [0] +- 1, and a startup message

is sent to processor 1.

-51-

1. (Check for memory module access request) Scan around the four

memory connections, beginning with the one following the one which

last was granted access, until a request is found or all four have been

checked. If a request is found, place it in COPY and go to Step 3.1.

2. (Check for processor access request) Scan around the fifteen pro-

cessor connections, beginning with the one following the one which

last was granted access, until a request is found or all fifteen have

been checked. If a request is found, place it in COPY. If it is a con-

trol request, let j=number of the source processor, and go to step 4.

3.1 (transmit a message) Transmit the contents of COPY to the requested

destination connection.

3.2 Wait for one bus cycle time. Go to step 1.

4. (perform control action) If the request is a STOP, then go to step 5;

If the request is a DETACH, then go to step 6. Otherwise, go to

step 7.

5. (do STOP) Set BUSY [j]- 0, where j = source processor number.

If SLAVE [j] = MASTER [j] = 0, then wait for one bus cycle and go

to step 1.

-52-

6. (do DETACH) If MASTER[ j] # 0 then send a detach message to

processor MASTER[ j] and wait for one bus cycle;

SLAVE [MASTER[ j] 1 +SLAVE[ j] ;

If SLAVE[ j] # 0 then send a detach message to processor

SLAVE[ j] and wait for one bus cycle;

MASTER [SLAVEI j]] - MASTER[ j] ;

MASTER[ j] + 0;

SLAVE[ j] + 0;

If no wait has yet been performed in this execution of step 6,

then wait for one bus cycle.

go to step 1

7. (FORK request)

If BUSY [i] = 1 for 1 s i sl 15, then send failure message to

processor j, wait for one bus cycle, and go to step 1.

Set i = smallest integer such that BUSY [i] = 0;

BUSY [ i] .- 1;

MASTER[ i] h- j;

SLAVE[ j] t i;

Send a startup message to processor i, wait for one bus cycle,

and go to step 1.

-53-

“The engineering of large software systems has become too complex for a single package design to have a reasonable chance of success upon implementation. ”

H. C. Rats, 1971

CHAPTER 3

Introduction

DESCRIPTION OF EXAMPLE PROGRAMS

The three programming examples described here were chosen (1) to

illustrate algorithms which make use of the available multiple asynchronous

microprocessors in novel and interesting ways, (2) to exercise the multiple

microprocessor system in ways which will demonstrate its capabilities and

limitations due to organization, (3) to demonstrate implementations of plausible

“higher-levell’ machine language instruction.

The first two examples perform in-place sorting of a table of N numbers.

Example 1 uses a “bubble” or %iftingVt sort algorithm, in which up to N-l pro-

cessors are connected in a chain through which the numbers are passed. This

algorithm avoids main memory references at the expense of increased inter-

processor communication.

Example 2 performs a sort using the Shell (or “diminishing increment”)

method; increments equal to ak are used so that data dependencies are predictable.

This algorithm is “fasterl’ as N becomes large, but depends on memory references

to a greater extent than example 1.

The third example, matrix multiply, performs a relatively large amount of

numeric computation.

The emphasis throughout these examples is to arrive at techniques for con-

quering the computational burden by using as many processors as possible,

-54-

without depending upon the availability of any specific number of processors

(greater than 1).

Expected Run Time Variation

In each example, the expected computation time depends in different ways

upon the size of the problem (N) and the number of processors available (k).

The computation time, of course, depends on the speeds of the system com-

ponents. Below, we develop the expected run times in terms of number of

main memory accesses, since this is the limiting factor in examples 2 and 3.

For example 1, the number of interprocessor data transfers and the number

of FORK commands executed are also taken into consideration.

In example 1, the sifting sort, the sorting process is an “N‘” process -

that is, the number of comparisons to be performed varies as the square of

the number of elements in the table. In the implementation here, the time to

compare two elements is small compared to the time required to fetch an

element from main memory, but the number of fetches nonetheless increases

with N2. The total loading of the interprocessor bus is expected to be a

limitation also since each data element is passed on the bus an average of l/ 2

N times. In addition, the number of forks attempted is shown since this number

rises with the size of the problem as long as the number of processors avail-

able is less than N-l.

The variation of these three items with N and k is shown below, based on

the actual implementation.

N = number of data elements in table

k = number of processors available

number of memory accesses: M=N2/k+N

number of bus accesses: B=N2/2

number of FORKS attempted: F= (N2/k+N)/2

-55-

The number of bus accesses, B, does not include memory accesses, although

the address and data are, in fact, transmitted on the bus. The two tables be-

low illustrate the variation of those numbers with N and k. In table 3-la, we

set N=32 and show how the numbers increase for decreasing numbers of pro-

cessors available.

TABLE 3-la

EXAMPLE 1: VARIATION OF PROBLEM CHARACTER- ISTICS WITH NUMBER OF PROCESSORS

l------------ k, number of processors available 1 N=32 32 16 8 4 2 1

Memory I accesses, M 64 96 160 288 544 1056

bus accesses, B 512 512 512 512 512 512

forks attempted, F 32 48 80 144 272 528

We see in the table above that for small numbers of processors, the memory

accesses become the dominant factor, particularly if we assume that memory

access time is 4 to 10 times the bus cycle time. The number of forks attempted

also increases, adding to the computation overhead as the number of processors

decreases. This is the penalty we pay for attempting to maximize the usage of

all available processors. In this system, the fork overhead is not large, but

in a system with a slower allocation mechanism, it could become significant.

-56-

The table below shows the variation with N for a constant number of processors

(k=16). The last line of the table shows, for comparison, the variation of

Nlog2N. Typically, a good sort algorithm on a single processor would run in

some small multiple of this number.

TABLE 3-lb

EXAMPLE 1: VARIATION OF PROBLEM

CHARACTERISTICS WITH NUMBER OF DATA ELEMENTS

N, number of data elements in table

Nlog2N 64 160 384 896 2, 048 4,608 10,240

Here we see that while M is increasing with N2/ k, B varies with N2/ 2. There-

fore, we would expect that the bus should be capable of handling the load as long

as the ratio of memory cycle time to bus cycle time is greater than half the

number of processors. (That is, if execution time is limited by either M or B,

the load is “balanced” when Me TM 3 B. TB; or TM/ TB = (N2/ 2)/ (N2/ k) = k/ 2.

Thus in this case, a cycle ratio of 8 should support up to 16 processors

without bus saturation. These numbers, of course, do not account for the

spreading of the processing load over time due to the asynchronous nature of

the processors and the memory. This effect would tend to leave the bus un-

saturated for slightly higher loads than indicated.

Example 2, the Shell sort, spends most of its time executing a series of

insertion sorts. The expected number of interchanges required to order M -57-

numbers using an insertion sort is developed in Knuth [Knuth, D. E. , “The art of

computer programming, Vol. 3: Sorting and Searching”, Addison-Wesley (1973),

Section 5.2. 1, pp. 84-951. The improvement dl to the Shell sort relies on

the partially-ordered properties of the table after each insertion sort: The

number of insertions expected on each pass is decreased because of the sorting

that has been performed already.

Only the worst-case sorting time is considered here; this at least gives an

estimate of the expected improvement with multiple processors.

Only the number of memory references required in the example 2 program

is considered here since this is the dominant factor. Some of these references

are “read-modify-write”. An insertion sort of m items requires at most

m2 + 2m - 2 memory references.

In example 2, we perform

1 insertion sort of N items

2 insertion sorts of N/ 2 items

4 insertion sorts of N/4 items

2 r-2 insertion sorts of 4 items

where N = 2’. The sorts are performed on the smaller number of items first,

and we have chosen not to perform the two-item sorts. Assuming the number

of memory references is m 2

for a sort of m items, and summing over these

sorts, we obtain

M=2 r-2. 42+2r-3. S2+. . .+ 1. N2

= $-. 22+2r-1. 24+. . . +z2. 22r-2

= 2r. kz22i

-58-

But now, with the parallelism obtainable with multiple processors, we can, in

principle, perform all of the first 2 r-2 insertion sorts at the same time since

they operate on different data. After these are completed, the next 2 r-3 sorts

may be performed in parallel, and so on. (In fact, the example 2 program

allows each of these “pass 2” sorts to go ahead as its data are released from

the two insertion sorts in ‘pass 1” which operated on them. ) Assuming that

at least N/4 processors are available, the execution time in terms of memory

references becomes

M = 42 + B2 + . . . +2 2r

= 22r (24-2r + 2+2r + . . . + 1)

=22r(1 +2-2+2-4+...)=4N2

Thus, the maximum effect of parallelism based on this simple (worst-case

insertion sort run time) model is a reduction of run time by 339?? (4/3 N2 rather

than 2 N2 above). This reduction is so small because the dominant term in

each case is the last one (N2) which is due to the insertion sort of N items,

which cannot be speeded by parallelism. The Shell sort would tend to be

better than this (and would be speeded more by parallelism) because the N

item insertion sort starts with data which are already partially arranged.

In example 3, the matrix multiply, we find that the number of memory

references is 5N3 + 1 1N2 + O(N), for the one processor case, where N=number

of rows and columns of the square matrices. In the implementation of this

example, processors are initiated as fast as their data offset addresses can

be calculated, so that we expect the run time in terms of memory references

to be approximately 5N3/ k, k=number of processors available.

-59-

However, memory references are not the limiting factor in this calcu-

lation. The number of microinstructions required to compute a product is

approximately 150 (fewer, in case of a zero operand or overflow). Thus, the ”

matrix multiply could require up to 150 N’ microinstructions, plus the time

for data access. For N=6, this yields I=63 = 32,400 microinstructions to be

executed. Compare this with M = 5. 63 + 11. 62 = 1476 memory references.

Even if a memory reference requires 10 times as much time as a micro-

instruction execution, the microinstruction time dominates by a factor of 2

(32,400 vs. 14,760 instruction-times). Asymptotically for increasing N, the

microinstruction execution time would dominate by a factor of 1 50N3/ (10’ 5N3)

= 3. Thus, we have a program which is computation-bound, Its execution

time varies with N3 and inversely with k, the number of processors available.

The computation overhead does not increase with increasing number of pro-

cessors, SO we expect the run time to approach the l/k curve typical of com-

putation-bound processes, as long as the computation is not competing with

other processes for resources.

-6O-

Table 3-2 summarizes some of the empirical characteristics of these

three examples. Example 2 is the heaviest user of main memory, primarily

because some of the actively-used address counters are stored in a stack in

main memory. Example 3 uses main memory in bursts - whenever a sum

accumulation is done and new operand addresses are calculated. Example 1

makes very few references to main memory except when the number of avail-

able processors is very small. Processor utilization is light except for the

%ompute-bound” Example 3.

Each example has a total run time which depends on the size of the table

or matrix. However, Example 2 is the only one in which the values operated

on have a large influence on the execution time; the Shell sort time can be sig-

nificantly lowered because the insertion sorts are shorter when the data are

“well-conditioned”. Example 3 is slightly data-dependent - the multiplication

terminates early when the multiplier or the multiplicand is zero.

The actual size of the microcode used to implement each of the examples

is shown. Examples 2 and 3 use, in addition, a stack in main memory. In

Example 2, with N = 100, 500 words are used; in Example 3, with N = 6, 113

words are used.

The maximum theoretical number of processors is the number which could

be used if allocation were limited only by the algorithm. Example 1 could use

31 processors to sort a 32-length table. Example 2 could, in principle, use

16 processors during the first part of a Shell sort on a loo-item table. (In

practice, the first eight processors terminate so quickly that only eight pro-

cessors were ever active at one time in this study. ) Example 3 uses one pro-

cessor on each inner product, using 36 of them for the 6 by 6 case here. (In

practice, a maximum of 13 processors were active at one time in this study. )

-61-

TABLE 3-2

SUMMARY OF PROBLEM CHARACTERISTICS

l&ample 1 Example 2 Example 3 Sifting-Sort Shell Sort Matrix Multiply

N =32 N = 100 N=6

Main memory usage light heavy moderate (heavy bursts)

Processor usage

Run time is dimension-dependent

Run time is data-dependent

light

yes

IlO

light

yes

yes

heavy

yes

3-s (multiply only)

Storage requirements Microcode Main memory stack

Maximum theoretical No. of processors

162 bytes none

N-l (31)

306 bytes 5N words

16 practical : 8

341 bytes 3N2 + 5 words

N2 (36)

-62-

Processor Allocation and Deallocation

At the start, there is a pool of up to fifteen processors available to the al-

locator. These processors are allocated by means of a FORK microinstruction.

An unallocated processor is in a STOP state, waiting for a startup message from

the allocator (in the bus). The startup message gives the address of a starting

microinstruction.

For cold start, a FORK 0 startup message is generated autonomously by the

allocator, causing the first processor to begin executing at location zero. At this

point, we would diagram the operation of the one processor as below.

Fl PI

The symbol at the top indicates that Pl has no 9naster1~ attached to it - it is

operating at the top level. If now PI executes a FORK microinstruction which

is successful, the resulting diagram would be

Here, the connecting line between Pl and P2 indicated that they are in a

master-slave relationship: to send a data word to P2, Pl would execute an “out-

put to slave” microinstruction; similarly, P2 would execute an ‘toutput to master”

to send a word the other direction.

-63-

If P2 now executed a FORK which is successful, the result is

Pl

8

P2

P3

We note that P2 can detect the presence of the slave, P3, by executing a

“test slave” microinstuction, which sets a condition code based on whether a

slave is attached or not. Similarly, a processor can test whether it is operating

at the top level or not by executing a “test master”.

To remove itself from a chain, a processor may execute a DETACH micro-

instruction. Lf P2 executes a DETACH while in the configuration above, the re-

+1 P3

Thus, we now have two active chains. Note that the masterof P3 has become Pl;

the chain has been “pulled up”. Normally, P2 would now execute a STOP micro-

instruction which would put it back in the pool of unallocated processors. This

is not required, however - it is possible for P2 to continue processing. But it

can no longer communicate directly with Pl or P3.

Not also that a fault condition results if Pl executes a FORK while a slave is

attached to it. A DETACH is permissible at any time. If a STOP is executed while

a processor is attached to either a master or a slave, the DETACH operation is

-64-

first automatically generated by the allocator.

Example 1: Sifting Sort

A set of numbers is stored in consecutive locations of main memory, starting

at location I (initial) and ending at location F (final). The numbers are to be re-

arranged into non-decreasing order.

The algorithm here creates a chain of processors, each of which “sifts” the

data being passed to it, always keeping in a local register the smallest data item

it has seen. The “top” processor fetches the data from the table and after comparing

the first two, passes the larger to the second processor. After fetching and com-

paring all of the data, the top processor puts the remaining item, which is the

smallest, into the first table position. The second processor has had all but that

one item passed to it, so the one remaining is the second largest, and it places

it in the second table position. Similarly for the rest of the chain of processors,

except that the bottom processor stores the last two items. Thus, up to N-l pro-

cessors may be used to sort a table of N items.

What happens when fewer than N-l processors are available? Since each data

item being handed down the chain vacated a position in the table, the table itself

may be used to buffer the partial results; the bottom processor, when it is not the

(N-l)th one, stores the data it would otherwise pass on. After the last item has

been stored (and all the processors above it in the chain have terminated), the

bottom processor @-the top processor and it restarts the sort on the remaining

data.

The basic algorithm is as follows:

Algorithm B (Bubble, or sifting, sort)

N data items in memory, stored at location I through F, are sorted in place.

If N-l processors are available, each data item is fetched only once. If only

K < N-l processors are available, then the first K items are sorted on the first

-65-

pass (Le., fetching each of the N items once), etc.

ltGet’l means either fetching a data item from memory or receiving one from

the processor above in the chain (the ~~master~‘); “PuVt means either storing a

data item in memory or sending it to the next processor below in the chain (the

“slave”).

Bl.

B2.

B3.

B-4.

B5.

B6.

Bl.

B8.

B9.

(Initialize) Get an item -A.

(Receive next item) Get an item - B.

(Compare and exchange) If A> B then exchange A and B.

(test for (local) end) If no more items, go to step 8.

(check on chain-building) If a slave is already attached, go to step 7.

(add to chain) Request allocation of a slave (slave to start at step 1).

(send item) Put B; go to step 2.

(Finish with A and B) Put B; store A into table.

(test for chain position) If a slave is attached, terminate.

B10. (test for (global) end) If I am processor N-l, return to Calling program (done); otherwise, increment my number and go to step 1.

The top processor always fetches data from memory. A processor which has

requested a slave, but did not receive one, stores all of its data into memory.

To coordinate passing of data from a master to a slave , a token (a data word)

is passed from slave to master each time the slave is ready to receive a new

data item.

The master distinguishes the token from inputs from memory and from its

master by using the “test input” instruction. (See the detailed flowchart in

Figure B-6, Appendix B.) Thus, step B7 above may be delayed while waiting

for a token from the slave, and step B2 may include sending a token to a master

and waiting for the next item to be sent in response.

-66-

Example 2: Shell Sort

The Shell Sort algorithm is a faster method for large tables than the

sifting sort used in Example 1. Basically, this method performs repeated

‘Msertion” sorts on subsets of the items in the table, working on successively

larger sets. The method gains speed over the sifting method because the sets

in successive passes are partially sorted. This method is described in detail

and analyzed in Knuth [ Knuth, D. E. , op. cit. ] , where it is called a “diminishing

increment sort”. We have chosen increments equal to 2k in order to make the

process neatly parallel and to make the data dependencies orderly.

We define two parameters - a pass number, P, and an offset number, R.

The pass number varies from 1 to log2N, where N is the smallest power of

two larger than or equal to the length of the table. On each pass, each element

of the table is involved in one insertion sort. There are N/2’ of these inser-

tion sorts. The offset number, R, designates these sorts, varying from N/ 2 P

down to 1. R is called the “offset” because it also designates where in the

table the first element of the insertion sort is to be found. The elements in an

insertion sort for pass, P, offset R are to be found at locations R, R+N/2P,. . .

The simplified flowchart in Figure 3-3 describes the procedure used to im-

plement this Shell sort algorithm. The first processor begins at START in the

flowchart, and initializes its offset number, R, to N/2. Then, entering the

‘pass” loop, it initializes the pass number, P, to 1. Next, “Lock TABLE(R)”

is executed, meaning that the first element to be used in the insertion sort is

removed from the table and a special value is replaced in it. This prevents

another pass from starting prematurely. In Figure 3-4, the table locations

which are at one time or another Yocked”, are shown with boxed X’ s. Next,

entering the insertion sort loop, the processor checks to see whether another

processor could be utilized on this pass. If R=l , another is not needed because -67-

R = offset number

P = pass number

1 Lock T?BLE (R)j

FIG. 3-3. Shell sort - simplified flowchart.

-68-

this processor is doing the last insertion sort of the pass. In this case, it

skips immediately to the insertion sort, shown as SORT(P, R).

Otherwise, the processor executes a FORK microinstruction,, which re-

quests allocation of another processor. The double lines indicate potential

startup of a second program branch, with the “slave” processor beginning at

the SLAVE START node. If a slave was allocated (the “master” receives an

acknowledgment if it was), the “master” sends initialization parameters to the

“slave”: an offset number equal to one less than that of the “master”, plus

addressing information. The “master” then goes on to perform the insertion

sort for that pass.

Each physical processor keeps its offset value, R, constant for as long as

it is allocated. It does, however, go on to perform insertion sorts in subsequent

passes, if needed. Observe, for example, in Figure 3-4 that once the first

three insertion sorts are complete (pass and offset equal to (1,4), (1,3), and

(1,2)), the fifth one (2,2) can be performed without waiting for the fourth one

(1,l). In the flowchart, then, we see the processor which does the (P, R) in-

sertion sort incrementing P and checking to see if the (P+l , R) sort is one which

needs to be done. This is shown as the comparison of the current P to Pm,,

where Pmax is equal to N/2R. If the comparison indicates that another pass is

needed for this value of R, the processor stays in the insertion sort loop. If a

slave has not yet been allocated, the processor will again go through the test to

see if one is needed, and re-request one if necessary.

When no more passes are needed for this value of R, the processor

“unlocks” the table element at TABLE(R), and tests to see whether it is needed

for any further computation. If R=l, the Shell sort is finished. If R > 1, there

is more to be done; if a %lave” has already been allocated, the “slave” will

take care of further allocations, and this processor returns itself to the pool

of free processors. In the case that no slave has been allocated, this (only

available) processor takes on the role of the slave by decrementing its offset

number, R, and going to the beginning of the r’pass” loop.

‘ass, OfEset, Table Location Physical Processor P R 1 2 3 4 5 6 7 8 Number

[xl X

Eel X

El X

El X

rji-J x x x

@ x x x

[XX xxxxxx

Figure 3-4. Shell sort - implementation strategy, N=8

Figure 3-4 illustrates the implementation method for a table of length 8.

The pass number and offset are listed at left, and an X is shown under each

table location which is involved in that insertion sort. If this sort were per-

formed with a pool of only two processors, then the physical processor per-

forming the insertion sort would be the one whose number appears in the right

hand column. The execution with two processors would be as follows: Pro-

cessor number 1 initializes P to 1 and R to 4 and f’locks” location 4. This

means that it stores a special value into location 4 as a flag to prevent a later

pass from using it before the insertion sort is finished. The contents of the

Yockedll location are kept local to the processor during the insertion sort.

Next, processor 1 executes a FORK, and starts up processor 2 on the sort for

P=l, R=3. Processor 1 then proceeds with the insertion sort. Meanwhile,

-7o-

processor 2 attempts a FORK, but fails to receive a slave. It then proceeds

with its sort. When processor 1 completes its sort, it discovers (in the P:Pmax

comparison) that no more insertion sorts at offset 4 are required, and it, there-

fore, “unlocks” location 4 and terminates. Processor 2 then finishes similarly,

except that when it finds no slave attached, it goes on to perform the sort for

P=l, R=3. It also succeeds with its FORK this time, starting processor 1 on

the sort for P=l , R=l. (Note that now processor 1 is the slave of processor 2).

When processor 2 finishes this time, it goes on immediately to the P=2,

R=2 sort. Location 2 in the table remains “locked”. Similarly, processor 1

goes on to the sort for P=2, R=l . Next, processor 2 finishes and terminates.

Processor 1 goes on to pass 3, offset 1, the final insertion sort. Note that in

general, the final sort may have to wait until the P=2, R.=l sort is finished; a

potential wait occurs whenever a boxed X occurs below another one in the figure.

See Appendix B for a detailed flowchart of the implementation.

Since the overhead of setting up an insertion sort is considerable, we have

chosen to eliminate pass 1 (increment = N/ 2) in the implemented algorithm so

that the initial increment size is N/4.

Example 3: Matrix Multiply

This algorithm performs matrix multiplication on two square matrices.

The implementation here performs fixed-point multiplication using shift, add

and test microinstructions. Each processor works on one of the N2 required

inner-product calculations, accumulating the sum in the result matrix. Figure

3-5 shows a simplified flowchart of the matrix multiply as implemented. In

order to claim as many processors as are available, the bottom processor, if

it is not the last one required, requests a slave after every multiplication and

sum.

-71-

n0

-+

more ?

9s

I

,i

FORK L-r FORK

L

Initialize

yes

yes

23 stop

FIG. 3-5. Matrix multiply - simplified flowchart.

-72-

Figures 3-6 and 3-7 show the contents of memory, and of some of the

processor registers, for an example, with N=2 (2 x 2 matrices). Matrix A is

stored at locations 300 through 303, in row order. Matrix B is at 400 and C,

the result, is at 500. The final stack contents are shown in Figure 3-6. The

first processor is presumed to contain the three matrix base addresses and

the length, and to have the stack base address. As each processor is allocated,

the master processor computes the next vector base addresses (shown as AADR

and BADR), the indices (I and J), and the result element address (CADR). It

then passes these values along with the length (LEN) to the slave. The slave

then builds a stack area for temporary storage and continues by attempting to

allocate another processor. The end is recognized by the last processor when

I = J = LEN. See Appendix B for a detailed flowchart and the program listing.

-73-

. .

Location Name

600 FLAG

601 IBASE

602 JBASE

603 ABASE

604 BBASE

605 AADR(l)

606 BADR(l)

607 STKBASE 608 AADR(2)

609 BADR(2) 60A STKBASE 60B AADR(3) 60C BADR(3)

60D STKBASE

60E AADR(4)

60F BADR (4) 610 STKBASE

Initial Contents

300

400

Final Contents

0 or -1

2

2

300

400

300

400

600 301

401

600

302 402 600

303

403 600

Comment

set to -1 if overflow occurs

FIG. 3-6. Stack contents for matrix multiply, N=2

-74-

Processor

Register Name Pl P2

STK 605 608

CADR 500 501

LEN 2 2

I 1 1

J 1 2

AADR 300 300

BADR 400 401

P3 P4

60B 60E

502 503

2 2

2 2

1 2

302 302

400 401

FIG. 3-7. Register contents for matrix multiply, N=2

-75-

“‘But to get things done, you must love the doing,. . . 1’11 be glad if people who need it find a better manner of living in a house I designed. But that’s not the mo- tive of my work. Nor my reason. Nor my reward. ?ll

Ayn Rand, The Fountainhead, 1943

CHAPTER 4

MEASUREMENTS AND RESULTS

Measurements

For the purpose of determining efficiency and relative loads, a large number

of measurements are taken on each simulation run and are printed on the summary

sheet. A sample summary sheet is shown in Figure 4-l. The following terminol-

ogy applies in the descriptions of these measurements:

LENGTH means the run time of an example, expressed in number of processor cycle-times

WIDTH refers to the number of processors executing concurrently, at each cycle

AREA is the total number of processor cycles allocated to the example. It is the measure of the “processing” resource received by the example, and thus not available to any other user of the system. The area is computed by

length

c WIDTH (t)

t=1

In the simulation, most of the measurements are gathered in the bus module.

Raw counts of the length and area are made in terms of the bus cycle time. When

the bus cycle time and the cycle time of the processors are not the same, these

numbers are multiplied by the ratio, to render the length in units of processor

cycle-times. These are shown as t’lengthll and “adjusted area” on the summary.

-76-

9 424 1.1.. 10 993 ****t****** ,I *J+t+, .******ttt*******~**~*************..~*,~,*~~~.**** I* 4023 *******t**tt**f**l***~********..**~.~~,,,***~ 13 786 t.... t...

FIG. 4-l. Sample summary sheet.

The total number of microinstructions executed by all processors, shown as

?&al minstr” on the summary, is the sum of actual instruction counts kept by

each processor module.

We now have the materials for the first important measure: how many of the

processor cycles allocated t.c the problem were used to execute microinstructions?

This efficiency measure is given as “true minsct/adj area” in the summary. The

number of processor cycles spent in waiting is shown as 71measured wait count”,

and is also displayed as a percentage of the area. Other measures gathered are

the following:

Average microinstruction parallelism: The total number of microinstructions

executed, divided by the run time , gives the average 3vidth11 of parallelism.

The maximum width is also shown, as well as a histogram of the width versus

the number of cycles that the number of processors was active.

Average allocated parallelism: The total number of processor cycles avail-

able to the program, divided by the run time, gives the average parallelism

as seen by the allocator. This is the measure of processor resource demand

by a task as it would compete with other tasks in the system. The ratio of

this average to the microinstruction parallelism average is the same as the

microinstruction/area efficiency measure.

Bus traffic : The total number of bus communications, including control,

inter-processor, and processor-main memory transmissions, is. tallied here.

Also displayed are the traffic per microinstruction, the traffic per run time,

and the percentage of total bus bandwidth used.

Memory references: The total number of main memory cycles initiated is

tallied, including fetch, half-read, store, and read-modify-write cycles. Al-

so displayed are the number of cycles per microinstruction, cycles per run

-78-

time, and percentage of maximum memory bandwidth used.

Memory mode count shows the count of each type of memory reference.

Memory bank count shows the number of references to each of the four main

memory modules.

Successful FORKs is the number of times the allocator provided a slave

processor to the task (including the cold start). FORKs attempted is the

number of times a slave was requested. The number of successful FORKs

is often characteristic of the problem: In Example 1 (sifting sort), it is

always equal to the table length minus one; in Example 3 (matrix multiply),

it is equal to the number of inner products to be performed (2).

Using the measures just described, we answer the following questions in ths

this chapter.

1. How frequently, on the average, does a microprogram refer to main

mmory? How frequently does it require bus communication? Are

these frequencies characteristic of the microprogram?

2. What bus capacity (bandwidth) is necessary to achieve adequate efficiency

for each example? Is there a value which satisfies all examples?

3. How important is main memory cycle time? Are there any examples

which are unusually dependent on memory speed?

4. How many processors can be used by each of these examples? What

limits the amount of parallelism achieved?

5. How much does run time decrease when increasing numbers of processors

are made available?

6. Assuming that the cost of a system goes up linearly with the number of

processors in it, what number of processors minimizes the cost per

throughput for these examples?

-79-

The three examples were deliberately chosen to have different characteristic

resource demands. Example I, the sifting sort, has relatively few references to

main memory, but uses many bus cycles for inter-processor communication.

Example 2, the Shell sort, depends heavily on main memory access, with rel-

atively little other bus communication. Example 3, matrix multiply, is the only

example to perform a large amount of computation.

Results

1. Bus and Memory Utilization

A not-too-surprising result, suggested by question 1 above, is that the aver-

age number of bus requests and the average number of main memory accesses

per microinstruction executed remains nearly constantfor each example, over

a wide range of running conditions (see Table 4-2). The change of memory usage

with varying availability of processors in Example 1 shows clearly in this

measure, while in the other two examples it is clear that serialization of the com-

putation with few processsors does not affect the average occurrence of memory

or bus accesses per microinstruction. TABLE 4-2

AVERAGE UTILIZATION OF BUS AND MAIN MEMORY

Main Memory References Bus Accesses Per Micro- Per Micro- Instruction Instruction

Example 1 - 15 processors 0.5% 7%

Example 1 - 1 processor 5 % 11%

Example 2 10 o/c 20%

Example 3 4 o/c 11%

NOTE: BUS accesses include address and data transmissions to main memory

and data transmissions to processors from main memory. “Main memory

references” is the number of fetch and store cycles initiated. - -

-8O-

In the Z-machine, all memory access takes place by communication across

the common bus. Wherever memory access time is referred to in the measure-

ments to follow, the time includes two bus cycles (one for address transfer, the

second for data transfer). Memory rewrite time (when simulated, as for core

storage) is not included.

2. Bus Capacity

The general result on bus capacity, in answer to question 2 above, is that

for all three examples, a bus cycle time equal to the processor cycle time is

adequate to maintain good efficiency (microinstructions/area). In particular,

when the memory cycle time is not more than four times the processor cycle

time, a bus cycle time equal to the processor cycle time yields efficiency in all

examples of 74% or better. This efficiency is reflected correspondingly in the

run time.

Tables 4-3, 4-4, and 4-5 show the efficiency results of running the three

examples with varying bus and memory speeds. All speeds are expressed as

a ratio to processor cycle time. The results of these variations in run time

are presented graphically in Figures 4-6, 4-7, and 4-8. In the latter three

figures, the same result data are presented in two ways - run time versus mem-

ory speed for constant bus speed, and run time versus bus speed for constant

memory speed.

These results suggest that it is quite reasonable to operate with the bus

“saturated”, i. e. , operating near capacity. When the processors are occasion-

ally waiting for access to the bus, it indicates that the bus is being utilized

fully. On the other hand, the sharpness of run time variation with bus speed

indicates that bus capacity is a crucial resource.

-81-

TABLE 4-3

Memory

Speed

EXAMPLE 1: SIFTING SORT-EXECUTION EFFICIENCY

Bus Speed

fast slow

I/4 l/2 1 2 4

Fast 1 ‘79

2 79 78 76 64 “EFFICIENCY” O/oMicroinstructions/

4 79 76 Area

7 79 78 76

Slow 10 36

Note: 15 processors used in all cases.

-82-

TABLE 4-4

EXAMPLE 2: SHELL SORT-EXECUTION EFFICIENCY

Bus Speed

fast slow

l/4 l/2 1 2 4

Memory

Speed

Fast 1 97

2 91 91 80 “EFFICIENCY’

4 80 79 74 50 $&Microinstructions/Area

7 66 65 50

Slow 10 55 56 48 28

Note: 8 processors used in all cases.

TABLE 4-5

EXAMPLE 3: MATRIX MULTIPLY-EXECUTION EFFICIENCY

Bus Speed

fast slow

Memory

speed

l/4 l/2 1 2 4

Fast 1 98

2 94 93 84 “EFFICIENCY”

4 86 85 81 %Microinstructions/Area

7 75 74 62

Slow 10 66 66 60 37

Note: number of processors used varied from 7 to 13.

-83-

8 1 EXAMPLE I-32

6 \

Note: 15 processors used in all cases

4

2

0 s

L--- &ary Speeds

dow 4 2 I l/2 l/4

BUS CYCLE TIME

FIG. 4-6. Example 1: sifting sort - bus speed dependence.

-84-

X V

r”

I-

:

X

40

30

20

IO

0

1 EXAMPLE 2- IO0 1

Note: 8 processors used

‘\ in all cases

2 BUS CYCLE TIME

l/2 l/4 fast

IO

0 slow IO 7 4

MEMORY CYCLE 2 I fast T I M E z,obB,,

FIG. 4-7. Example 3: matrix multiply - bus and memory speed dependence.

-85-

20

( EXAMPLE 3- 6 1

Note: Number of processors used varied from 7 to I3

I6

I2

8 IO 7

4

0 slow 4 2 I l/2 l/4 fast

BUS CYCLE TIME

I6 04

r

0 ’ slow i0 7

I

I fast MEMORY CYCLE TIME

FIG. 4-8. Example 2: Shell sort - bus ind memory speed dependence.

-86-

3. Main Memory Speed

We find that memory cycle time is quite important for efficiency in Examples

2 and 3 (but not in Example 1, where memory access was deliberately avoided).

A memory cycle time of seven times the processor cycle time yielded an efficiency

of 65% or better in all examples with bus cycle ratio of 1 or less. However, the

efffect on run time is less dramatic. Example 1 (Figure 4-6) did not show any

significant variation of run time with memory speed. The sensitivity of run time

tomemoryspeed is greatest, as may have been expected, for Example 2. Even

including this example, however, the memory speed dependence shows no sharp

features, and we conclude that the Z-machine organization has not introduced any

unusual sensitivity to main memory speed.

-87-

4. Amount of Parallelism

Table 4-9 shows typical parallelism and execution time results for the

three examples. Examples 1 and 3 were able to employ 8 to 9 processors on -

the average, when 15 were available. Example 2 used less than three on the

average. We attribute this to the nature of the Shell sort algorithm as imple-

mented here - the last two passes in the sort can use only two and one pro-

cessors, respectively.

TABLE 4-9

TYPICAL PARALLELISM AND EXECUTION TIME

Example l-32 Example 2 -100 Sifting Sort Shell Sort

Bus =1 Bus =l Memory = 2.75 Memory =4

Example 3-6 Matrix Multiply

Bus =l Memory = ‘7

Typical Average Parallelism

“Allocated’1 8.8 (15) 2.7 (8) 7.8 (11)

Microinstruction 6.7 2.0 5.8

Typical Achieved Execution Time

with maximum num- 3,100 (15) 14,000 (8) 5,900 (11)

ber of processors.

with 1 processor I

24,300 32,900 45,000

Figures 4-l 0 through 4-l 5 present graphically the amount of parallelism

achieved on the average for the three examples, For each example, two sets

of results are shown - one for fast memory and bus speeds and the other for

slow memory and bus. By comparing the two, one can see the extent to which

the hardware parameters affect the parallelism. Since adequate bus capacity

-88-

has turned out to be crucial to effective use of the system, we have shown the

bus bandwidth utilization versus number of processors on the same scale in each

figure. Observe the correlation of bus saturation with a dramatic increase in

wait cycles. Also, observe that for Example 3 (Figures 4-14 and 4-15) the

naximum number of processors in operation at one time decreases from 12 to

7 as the bus and memory speeds are reduced.

The number of processors which can be usefully employed in each example

is dependent both on the hardware parameters and on the coding (the algorithm

implementation). The primary limitation of the Z-machine, in this respect, is

that only one processor can be allocated at a time. Furthermore, typically

each processor must then be loaded with enough state information to determine

whether another processor should be attached. For this reason, even the most

independently computable tasks, as in Example 3 (matrix multiply), cannot all

be started up at once; Example 3 was, therefore, unable ever to use more than

13 processors at a time while computing 36 independent inner products. On

the other hand, this skewing of the computation, causing a corresponding

leveling of demand for processors over time, would be an asset in a multi-

programmed system. (Further remarks on multiprogramming are made in

Chapter 5).

-89-

12

8

4

2

I

100

50

0

I 2 4 8

a a 0

I I

I 2 4 8 I2 I5 NUMBER OF PROCESSORS

FIG. 4-10. Example 1 - parallelism with fast bus and memory.

-9o-

12

8

4

2

I

I 2 4 8 12 I5

I I I I

I 2 4 8 I2 15

NUMBER OF PROCESSORS 2106U

FIG. 4-11. Example 1 - parallelism with slow bus and memory.

-91-

8

4

2

I

0

100

50

0

I 2 4 8 I2 I5

SHELL SORT

MEMORY = 2 ( FAST)

I 2 4 8 I2 I5 NUMBER OF PROCESSORS 2,068.

FIG. 4-12. Example 2 - parallelism with fast bus and memory.

-92-

H cn -

12

8

4

2

I

0

100

50

0

SHELL SORT

I 2 4

_-------__---_--___-____-_-_-----_

I 2 3 8 I2 15 NUMBER OF PROCESSORS


-33-

I2

8

4

2

I

0

d ,>:.. .: :.: .:.:.: :p:: ::>> . .,.

EXAMPLE 3 - 6 MATRIX MULTIPLY

BUS= I (MEDIUM)

/

e

MEMORY= 2.125 (FAST) cyo

xc, 0

I 2

100 r ~_-~-----------__------~-~- __-__ --

0 ’ I I I I I I 2 4 8 I2 I5

NUMBER OF PROCESSORS 2lObW

FIG. 4-14. Example 3 - parallelism with medium bus and fast memory.

8

4

2

I

0

100

50

0

I

/

MATRIX MULTIPLY

2 4 8 I2 I5

-----

l 2 4 8 I2 I5

NUMBER OF PROCESSORS 210611


-95-

5. Run Time Versus Number of Processors

Figures 4-16, 4-17, and 4-18 show how the run time of the three examples

varies with the number of processors available in the system. (For each example,

results are shown for the same pair of fast and slow bus and memory speeds as in

the previous six figures. ) In each case, the run time for one processor is taken

as 1.00, and the rest of the run times are normalized so that the shapes of the

curves are commensurate.

We see again in Figure 4-16 that the characteristics of Example 1 are

relatively independent of machine parameters, and in Figure 4-17 that Example 2

is quite sensitive to them. It is also again clear here that Example 2 does not

make good use of more than 2 processors. We note also the apparent anomaly of

the 3-processor case in Figure 4-17, which is due to the requirement of the

Example 2 algorithm for four processors on the third-to-last pass, then two pro-

cessors on the next-to-last pass, and finally one processor on the last pass. The

third processor does not, by itself, add nearly as much to the computation as do

a third and fourth together.

Finally, we see the smooth variation of run time for Example 3 in Figure

4-18, but with variation with bus and memory speed of the maximum number of

useful processors.

The value of the decrease in run time with increasing numbers of processors,

expressed as cost per throughput, is investigated below.

6. Cost-performance Analysis

Below, we develop a simple cost model for a multiple-identical processor

computing system and analyze the cost per throughput to determine the optimum

-96-

tt I I I I

9 - c9 0

T 0 0

-97-

9 a!

c4 T

cu 0

- 0

0 0

6 3W

ll N

nu 3AllVl3kl

-98-

I I

0 oc!

c9 T

cd 0

. -

0 0

0 d

3WIl

Nrltj

3AIlVlzki

-99-

number of processors for each example. In this model, the maximum number

of processors available in the system is regarded as the cost basis - the single task assumption. This still leaves open the possibility of further cost reduction

by multiprogramming.

We assume that the processors, being identical, all cost the same amount,

p. The cost of other system components- bus, memory, peripherals, etc. ,-are

to be absorbed in the cost, S , of the system without processors. The object of

the following analysis is to answer the question: Given that I intend to purchase

a computer system of the type described, which can accommodate a number of

processors ranging from 1 to a maximum, say Kmax , how many processors

should I by? This question is answered here with figures for system cost divid-

ed by throughput-where throughput is naively defined as the inverse of job run

time.

C=costofsystem=S+k*p

where S = cost of non-processor components

p = cost of a processor

k = number of processors in system

Normalized cost = i = 1 + $ k

Run time with k processors = Tk Tk

Normalized run time = - T1

Throughput = l/Run time = TI/ Tk

Mk = (cost/throughput) with k processors = (1 + f * k). Tk

0 - T1

where f = 3 = processor cost non-processor components cost

MI = cost/throughput with 1 processor = 1 + f

normalized cost/throughput = 2 =(1;:;f). g)

-1 oo-

Mk For curves of constant cost, (letting f vary), we set M = 1 which yields

1

Tk 1+f -= I+k.f T1

(2)

Equation (2) above gives a family of constant-cost curves, which are shown

in Figure 4-19 for several values of f. This family of curves allows us to answer

the question: If a job runs with given (unit) speed with a single processor, how

fast does it have to run with k processors in order to justify purchasing those k

processors? The answer is given by locating on Figure 4-19 the point corres-

ponding to the ratio of run times and the number of processors. If the point lies

below the constant-cost curve corresponding to the marginal cost of a processor,

then the purchase decision would be made.

Figures 4-20, 4-21, and 4-22 show the cost per throughput versus number of

processors for three values of assumed processor cost. The minimum cost points

are summarized below in Table 4-23.

TABLE 4-23

NUMBER OF PROCESSORS FOR MINIMUM COST/THROUGHPUT

Processor Cost Assumption Example 1 Example 2 Example 3

f = 0.025 15 8 11 _

f =O.l 15 4 11

f =0.25 8 2 8

These results are for the “@picalf settings of bus speed = 1 and a moderate

memory speed. We conclude that if the cost of a processor remains t%malltt-even

as much as 250/C of the non-processor components cost-that these examples have

achieved economical usage of multiple processors.

-lOl-

u-l -

cu In

- d y

0 o

e-8

0

-

- d II - - co

* N

e N

0

co co

e N

0

- -

- d

d 6

d

3Wll

Nlltl

3AIlVl3~ x

IS03 W

31SAS

-103-

I .4

r” 1.2 F

5 1.0 IT W

L G 0.8

ii K 0.6 X

k g 0.4

22 If 0.2 Y u-l

-0 I I I I

I2 4 8 I I I5 NUMBER OF PROCESSORS ,10&A,’

FIG. 4-21. Example 3 - cost curves.

9 0 II Y-

N

d: _’

-’ 0 m

W

T

&I 0

- d

6 0

d

3Wll

Nfltl

3AIlVl3ki X IS03

W31SAS

- - al

d-

N

-105-

“The prediction of future progress requires technological sophistication, some ‘inside’ information about the research currently in progress, and a large amount of courage. 1’

William F. Sharpe The Economics of. Computers, 1969

CHAPTER 5

UNEXPECTED RESULTS AND POSSIBLE EXTENSIONS

Results From Execution Trace

In addition to providing a direct and effective means of detecting and cor-

recting errors in machine design and microprograms, the execution trace output

leads to observationof unexpected system behavior that would remain undetected

in an analytical or less detailed model.

Figures 5-l and 5-2 are samples of the execution trace generated by the

Z-machine simulation code. (This trace output is constructed by user-written

simulation code. ) Figure 5-l shows the beginning of execution of a run of Example

1, the sifting sort. Note the data bus transactions-first the cold start FORK and

its acknowledge cycle, then the ‘half-read” cycle initiation transmission to

memory location 0000, and so on. In the right half of the figure, which is the

continuation, the first processor generated FORK occurs, and we see processor

2 start up at microstore location 009 (underlined).

In figure 5-2, a section of the trace of Example 3, matrix multiply, is shown.

At the top of this figure, the execution of seven of the active processors is traced.

Processor 10 (hex ‘A’) is also active, but not shown. Processor 1 has finished

one inner product and has stopped; in the middle of the figure, we see it being

attached again as %lavet’ to processor 10 (boxed). Note also the heavy but not

completely saturated usage of the bus.

-106-

I ‘I I I I I I

‘I I

‘I ‘I

‘I I I

!I I

I I I

I

,I I

II

‘I

‘I ‘I

!I I I

I

I

I I

II :; ,I I I

,I

I

I

I

I

i I

I I I

I

I

I

I

I

I

: I

I

I I I

. . . .

. . . .

. . . .

. . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

#<-I nEnH ccc . . . . . . . . . . . . . . . . . . . .

la->, OATA CC16 . . . . . . . . I::: ::: . . . . . . . . I$ ::’ . . . . IC4C SR . . . . IC4L 1E! . . . . IC4F SK? . . . . IC5C LR

109 . . . . 110 . . . . 111 . . . . l&Z . . . . 113 . . . . II4 . . . .

FIG. 5-l. Sample of execution trace from Example 1.

i13z 1133 ,134 ,135

. . . .

. . . .

*ens . . . . MEHF

FIG. 5-2. Sample of execution trace from Example 3.

Figure 5-3 shows a portion of the trace from Example 2, the Shell sort. In

this run, the bus cycle time is four times the processor cycle time, and the bus

is completely saturated. As a result, we see many waits for both bus and

memory access. (Due to the cycle ratio, only one of each four microinstructions

is printed. ) In this trace, we discovered a phenomenon which had not been an-

ticipated. The peculiarly regular pattern of memory cycling is an undesirable

effect. The columns of “RD” here represent cycles during which a memory bank

has accepted a store request but has not yet received the data to be stored. The

regular patttern results from the coincidence of the four memory requests and

from the nature of the bus priority mechanism. Bus access is given to processors

on a cyclic basis: the processor which is grantedaccess on one cycle has the lowest

priority for access on the next cycle. We see here that this causes the store and

data transmissions from any one processor to be separated whenever another

processor is waiting for access. Serious deterioration of memory utilization

could have resulted, in the Figure 5-3 trace, if more than these four processors

had been requesting bus access (not necessarily to memory). A solution would be

to promote the bus access priority, for one transmission, of a processor which

has transmitted a.memory store or modify command.

Processor-to-bus Buffering

The processor has a buffer register in which ouput to the bus is held while

waiting for access. How much time is gained by allowing the microprogram to

proceed while waiting for bus access? We have rerun Example 3, matrix multiply

(bus = 1, memory = 7), with this output buffering disabled. The results are com-

pared in Table 5-4 below.

-109-

I I

i I I I I I I I I I

I I I I

I

I

I I

I I I I

I I I

I I I

i I

I I I

FIG. 5-3. Sample of execution trace from Example 2.

TABLE 5-4

COMPARISON OF RUNS WITH AND WITHOUT BUS BUFFER (EXAMPLE 3)

Run Time

ocessors

-lll-

The change in run time indicated that bus output buffering is not crucial under

the run conditions of this example.

Program “overhead”

After the three examples here were coded, the following question was raised:

How many of the microinstructions executed would be unnecessary if the problem

were being run on a single processor? In other words, how much program over-

head has been introduced by coding the algorithms for multiple processors, with

all of the necessary coordination? To answer this question, we examined the

microcode and counted the microinstructions which are devoted to coordination

(or would be unnecessary if we knew that only one processor were available) and

multiplied by the number of times that these microinstructions are executed in

the particular example cases run. The results, expressed as a percentage of the

total number of microinstructions are shown below in Table 5-5.

TABLE 5-5

Example 1 Example 2 Example 3 Sifting Sort Shell Sort Matrix Multiply

“overheadfl in One-Processor Case (Microinstr. ) 51% 3.6% 11%

Example 1 overhead seems unexpectedly high until we consider that the algo-

rithm, the sifting sort through a chain of processors, was designed for a many-

processor situation; the algorithm is intended to reduce memory references by

using large numbers of processors. The one-processor case here reduces to the

traditional bubble sort. The microprogram for Example 1, furthermore, is highly

“folded”- there is only one main loop for the three states possible for a processor.

This loop, therefore, has many status tests for alternate subpaths. All of these

-11%

tests would become unnecessary in a one-processor case.

Example 2 overhead is quite low, reflecting the relatively small amount of

parallelism in the implementation of the algorithm. Example 3 overhead seems

reasonable. The possible overhead was, in fact, taken into consideration when

the decision was made to attempt a FORK after each multiply and sum step in the

inner product loop. (See the simplified flowchart, Fig. 34, and discussion in

Chapter 3). We conclude that the microprogrammer in general has control over

the coordination overhead in that he can trade off some of the many-processor

parallelism for efficiency with one processor.

Data Bus Design

We have seen that the common bus used in the Z-machine is a crucial resource

It appears that the bus is the limiting factor in determining how many processors

can be adequately supported in the hardware organization. Several alternatives

are proposed in the “design alternatives” section below.

For large numbers of processors, say 100 to 1000, it is clear that some com-

pletely different approach is required. One direction of current research

(M. L. Schlumberger, private communication) into a balanced scheme of inter-

processor communication has suggested the following. Each processor would be

located at a node which has two incoming and two outgoing paths. A proposed inter-

connection scheme then guarantees that a message from any one node can reach any

other node in time log2N, where N is the number of nodes. The scheme includes a

straightforward address-transformation method of routing the messages. In ad-

dition, there are multiple paths between nodes, so that a failure in one link only

degrades the transmission time of some messages.

Such a scheme obviates two of the major drawbacks of the common bus in the

Z-machine. The whole system is vulnerable to failure in the bus; and the common

-113-

bus approach will always impose a limit on the number of processors which can be

attached.

Hardware Design Alternatives

The following suggestions cover alternatives which were considered-and re-

jected for this study, and some further possibilities which should be investigated

in the development of a viable complete system.

1. As long as there are as few as fifteen processors are in the system, a

crossbar switch may be economically justifiable for processor-to-

processor and processor-to-memory communication. The cost, however,

rises sharply with larger numbers of processors.

2. An obvious extension of bus capacity would be provided by using a

separate bus for processor-to-memory transactions.

3. The Z-machine processor microinstruction set does not contain any

microinstructions which are particularly helpful in decoding and

interpreting the “machine-language” instructions. For example, an

efficient means of performing a many-way branch would assist in

writing emulation routines. Also, instructions which use masked

portions of a register as an index would be helpful.

4. In coding the examples here, all variables which could not be held

in processor registers were stored in main memory. By using read-

write microprogram storage and adding a “store local” microinstruction,

we could use this “local” storage for such temporary variables. By

introducing writable microprogram store, we open up many problems

of microprogram protection, but these might be worth facing in return

for a possible speed advantage.

Software Alternatives and Extensions

The following ideas relate to extending the parallelism we have achieved to

-114-

higher programming ‘levels”.

1. The concept of multiprogramming in the conventional sense remains

valid’ in the context of the Z-machine organization. Competition for

processors at the lowest level, while it may not be under the control

of an operating-system-level scheduler, should not prevent successful

utilization of otherwise idle resources. In the economic analysis in

Chapter 4, the cost of all processors available, even when unused,

was charged to the “user1 program. Multiprogramming could, there-

fore, improve the resulting cost figures by recovering some of the un-

used cycles. As a direct extension of the work presented here, we

would be interested to run two or more of the microprograms con-

currently, observe the competition for resources, and watch for

potential dominance of one microprogram over another.

2. No provision in this model has been made for input/output processing.

It should be possible by a straightforward extention of the Z-machine micro-

processor to create an I/O processor that should have communication

paths to external devices. This and other approaches to I/O handling

should be explored.

3. A serious challenge in developing a viable multiprocessor system is to

have a well integrated operating system. We have briefly in the past

(Levy, J. V. , “Modules for Operating Systems and Computer Organization”,

Computation Group Technical Memo , Stanford Linear Accelerator Center,

August 19, 19 70) worked on developing some of the primitive operations

which would handle the queuing tasks of an operating system. Further

development of such operations in the context of a multiple microprocessor

system would be of value.

-115-

4. Once the interpretation cycle for some machine language has been

written in Z-machine microcode, there are extensions that can be

considered. One intriguing possibility is to introduce some (machine-

language) instruction overlap by using more than one processor in the

fetch-decode portion of the fetch-decode-execute cycle. To our know-

ledge, this would be the first instance of introducing instruction over-

lap and parallelism solely by microprogramming (i. e. , without modifying

the hardware).

5. High-level language parallelism by means of FORK and JOIN statements

or their equivalent has been proposed in many places. Implementation

of any well-defined set of corresponding machine language instructions

would be straightforward using the microcode level FORK, DETACH,

and STOP instructions. It would also be possible to define a variety of

machine language constructs which would give the programmer control

of the microprocessors directly. With this control, however, would come

the responsibility for assuring that the programs are deadlock-free.

6. The problem of performing a computation under real-time constraints

becomes more difficult with this multiple microprocessor model. Devel-

opment of hardware or software modifications to guarantee allocation of

processors or to give priority to certain programs would be worthwhile.

-116-

APPENDIX A

The CISCO Simulation System

The CISCO simulation system (J. V. Levy, “A Simulation System for Computer

Design”, Computation Group Technical Memo 113, Stanford Linear Accelerator

Center, January, 1971) is composed of three software packages, each written in

PL/I.

The CISCO compiler (Jim Low, “The CISCO Complier”, Report for Computer

Science 239, Stanford University, April 1, 1971; and Jim Low, WISCO Language

Reference Manual, Computation Group Technical Memo 114, Stanford Linear

Accelerator Center, December 1970) accepts a description of a machine to be

simulated, written in a PL/I-like high-level language. Its output is loader text

for the KID interpreter.

The CISCO assembler (J. V. Levy, “A One-pass Assembler for CISCO”,

Computation Group Technical Memo 135, Stanford Linear Accelerator Center,

July, 1972) takes “Opdef’‘-type instruction format specifications for a simulated

machine and assembles code for that machine. The output is initialization data for

memory in the simulated machine.

The KID interpreter (J. V. Levy, “KID- Interpreter and Loader Reference

Manual”, Computation Group Technical Memo 115, Stanford Linear Accelerator

Center, November, 1970) loads the machine description and initialization data and

then runs the simulation by interpreting the low-level code produced by the compiler.

The low-level operation of the interpreter is essentially a stack machine which

executes a series of functional blocks as co-routines.

-117-

APPENDIX B

Detailed Flowcharts, Data Values, and Assembler Listings

FIG. B-l. Z-machine definitions for assembler.

-iia-

Lh *E; c.N CL II, 13, (8, (81, * f UC * *

t * I

l R”S =TIK

0000 “00000002 *

*MEMO =TIK,

0000 “00000002 iT,KZ

0000 “0”000”08

HASTER SLAVE “EMORY NC, MASTER NOT SLAVE NOT HEHOR”

TIClE CCFlTHCL SETTIffiS

PROC ME co REG TIKI UC 2 MEMTiHE, RE‘ TI62 0‘ 8 MEMTlH.52

87. 88. 89. 90. 91. 92. 93. wt. 95. 96. 97. 98. 99.

LOO. 101. 102. 103. 104. 105. *ix. LOT. 108. 109. 110. 111.

FIG. B-l. (continued).

-119-

l * * * I *

*

*!4EMD -*EMDRY

DDOD “DDDDDDZD ODD1 “DDDDDDlC ODD2 “ODDDDOlb DO03 “OOODDO,4

5004 “ODDDDD1D 0005 “DDDOODDC 0o”b “DDODDDDB

DOD7 “oDDoDDD* l b!EMl =“EM”RY

DDDC “DOODDD,F DOD1 “ODDDDDLB 0002 "DDDDDDLT ODD3 "DDDDDDL3 ODD4 "DDDDODOF 0005 "DDOOODDR DDDb "OODDODD, ODD7 YDDODDDD3

*HEM2 =t4EMDR"

OODC "DODOODLE DDDL YDDDOODLA oooi! "DOODDDLb ODD3 "000000L2 DOD4 "DDDDDDDE 0005 "DOODDDO~ ODD6 "DDDDDOOb DOD7 "DDDDDDDZ

*HEM3 =ME”ORY

ODDC “DDDDDDlO DDO, “DDDDDD19 DOD2 YDDDDOD15 0003 “DDD”DDLl DOD4 “DODOODDD 0005 “DDDDODO9 DO06 “ODDDDDDS DO07 “ODODDDO,

l *

*MICROHEPI =STDRE

* l

*

*

R

PROC MEnO REG NEnORY DC 32 DC 28 DC 24 DC 20 DC I6 DC 12 DC 6 DC Q PRDC MEtI1 REG MEnDRY DC 31 DC 27 DC 23 DC 19 DC 15 DC 11 DC 7 DC 3 PROC nEn2 REG "EMORY DC 30 DC 26 DC *2 “C LB DC 14 DC 10 DC 6 DC 2 PRDC f4EY3 REG MEWXY OC 29 DC 25 DC 21 DC 17 DC 13 DC 9 DC 5 DC 1

PROC REG

196. 197. 198. 199. 200. 201. 202. 203. 204. 205. 206. 207. .?Ob. 209. 210. 211. 212. 213. 2,4. 215. 2L6. 217. 2LB. 219. 220. 221. 222. 223. 224. 225. 226. 227. t28. 229. 230. 231. 232. 233. 234. 235. 236. 231. 238. 239. 240. 241. 242. 243. 214. 245. 246. 247. 248. 249. 250. 25,. ncav4

FIG. B-2. Example 1 - assembler listing.

-120-

252 253 255 255 256 25, 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 27, 278 2,9 280 281 282 283 284 285 286 28, 288 289 290 291 292 293 2% 295 296 25, 298 299 300 301 302 303 304 305 306 307

OODC DO01 “002 0003

0004 0005 ODD6 0007 0000

0009 OOOd 0006 oooc 0000 DOOE OOOF OOlC 0011 ODLZ OOL3 0014 0015 0016 0017 OOl8

“Doooooco “00000078 “0000000F “0000007c

“oooooocl “00000079 “ooooooco “oooooo*F “00000058

“0000000, “00000079 “0000000, “00000078 “0000000 L “0000007c “00000048 “00000070 “ooooooco voooooo7A “DODODO49 “000000,0 “00000020 “0000004D “00D0000‘ “0000000,

l * l

l l

SL”

“Il. 2 S-1 STORING IN “EIIORY, -0 TO SLAVE “IL 3 INITIAL ADDRESS, ALSO INSERT AODRESS “AL 4 FIN&L IDDRESS “AL 5 RUNNING ADDRESS CDUNTER “AL 6 DPERlNO 1 V&L 7 OPERlND 2 IALVAYS CAUSE M-01

FORYARD REFERENCES

“AL DOF “AL OZE V&L 076 “AL OLD VAL ObC VAL OBF “*L 042 “AL 037 V&L OZA “AL 044 “AL OK “At. 096 VAL 009

SLbVE STARTS HERE R SET R TO VALVE OF 5 FROM MASTER

I NY ,NSERT ADDRESS

F FINII. LODRESS I G lN,T,lL,ZE THE RUNNING ADDRESS 0 5 5 SE, TO 0, ND STORES DONE YE, R

EP G R-.-O. FETCH FRDH “EMORY _..

MEW YIN WAIT FOR INPUT FROM MASTER OR HEMDRY

252. 253. 254. 255. 256. 257. 258. 259. 260. 261. 262. 263. 264. 265. 26.5. 26,. 268. 269. 2,o. 27,. 272. 273. 274. 2,s. 216. 27,. 278. 279. 280. 28,. 282. 283. 284. 285. 286. 28,. 288. 289. 290. 291. 292. 293. 294. 295. 296. 29,. 298. 299. 300. 30,. 302. 303. 304. 305. 306. 307. 106127

FIG. B-2 (continued).

-121-

308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 33-l 338 339 340 341 342 343 344 345 346 347 340 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363

0019 V0000007E STR OOlA VOOoOOocl LA1 0010 VOGOOOOAE AI OOIC voooooo58 R

0010 V0000004A OOlE v00000010 OOlF VOOOOOO36 0020 vooooooc 1 002 1 VOOOOOOAA 0022 voooooo58

0023 0024 0025 0026 0027 002a 0029 OOZA OOZH ooze 002rJ 502E OOZF 0030 0031 0032 0033 0034

voooooo2c v00000001 VOOOOOO4F VOOOOOOOA VOOOOOOFF VOOllOOO7A voooooo37 V00000040 VOOOOOOOE VOOOOOO4F voooooooB voooooo55 voooooo49 v00000010 VOOOOOOZE vooooooc2 VOOOOOOAZ voooooo58

SKI WIN LR OSL LI STR SK4 LR

#EMS LR DHEM INCR LH TEST SK3 LA1 AI a

0035 vooooooco 0036 vooooooo9 0037 v00000001 0038 VOOOOOOlc 0039 v00000038 0033 V0000007A 0038 v000000c1 003c vooooooB7 0030 voooooo58

+

LB

*

LI OMA WIN TIN SK4 STR LA1 AI B

003E VOOOOOC7F STR 003F vooooooc2 LA1 004c VCOOOOOAC AI 0041 voooooo58 B

0042 v00000040 0043 voooooooc 0044 v00000001 0045 v0000001c CO46 V0000003B 0047 V0000007A 004.9 vooooooc2 0049 VOOOOOOA4

EXAMPLE 1 51 FT ING SORT (14 APR 72) CISCO ASSEMBLER

+ + L4

c

L9

L2

* l

L7

LlO

LR TEST SK3 LA1 AI B

LR HEHH WIN 1 IN SK4 STH LAI AI

d L2115 L2

S

LE L9//5 L9

EP

B

-1 S LE G

B

G R

EP L7//5 L7

0

-SL 5 LB//5 LB

B L11//5 Lll

G

-.SL 5 L LO//S LIO

FIRST OPERAND

GO STAF.T UP

S>O, GO STORE R IN MEMORY

308. 309. 310. 311. 312. 313. 314. 315. 316. 317. 318. 319. 320.

IF S<O, WAIT FOR INPUT FROM SLAVE s=o : SEND B TO SLAVE

321. 322. 323. 324. 325.

SET 5 TO -1, INDICATES WAlTlNG FOR SLAVE IN 326. IALWAYS SKIPS) 327. S-0; STORE B AT G 32R.

HOVE UP THE ADDRESS CflUNTER

R-.-O, GO FETCH FROM MEMORY

R=Ot SEND SIGNAL TO MASTER

WAIT FOR INPUT IFROH MASTER1

FROM MASTER? NO: FROM SLAVE: SET S WAIT FOR MASTER

NEXT DATUM GO TO TFST AND COMPARE

FETCH FROM MEMORY

FROM MEMORY?

NO, FRflM SLAVE: SET S WAIT FOR MEMORY DATUM

329. 330. 331. 332. 333. 334. 335. 336. 337. 338. 339. 340. 341. 342. 343. 344. 345. 346. 347. 348. 349. 350. 351. 352. 353. 354. 355. 356. 357. 358. 359. 360. 361. 362. 363.


‘- 122-

EXAMPLE 1 51 FTING SORT I14 APR 721 CISCO ASSEMBLER

364 365 366 361 368 369 370 371 372 373 374 315 376 317 318 379 380 381 382 383 384 385 386 301 380 309 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419

004b voooooo58

0048 VOOOOOO7F 004c VOOOOOO4F 0040 VOOOOOObb OOQE v00000010 004F V0000003A 0050 V0000004E 005 1 VOOOOOO5F 0052 V0000007E 0053 voooooo4c 0054 VOOOOOOb5 0055 voooooo 10 0056 V00000026 0057 V000000c3 0058 VOOOOOOBb 0059 voooooo58

STR B LR B SR A TEST SK3 GE LR A EXR B STR A LR F SR G TEST SK3 CT LA1 L3/f5 AI L3 B

* OOSA VOOOOOOlA TSL 0058 VOOOOOOZE SK3 EP 005c VOOOOOOcO LAI L4ff5 0050 VOOOOOOBO Al L4 005E voooooo58 B

* 005F VOOOOOOcO LA1 SLVff5 OObC VOOOOOOA9 AI SLV 0061 vooooooo4 FORK 0062 VOOOOOOlA TSL 0063 VOOOOOO2E SK3 EP 0064 vooooooc3 LAI L5ffS 0065 VOOOOOOAC AI L5 0066 voooooo58 8

l

0067 v000000c1 LI 1 0068 VOOOOOO7A STR S 0069 vooooooco LA1 L4ff5 OObA V00000000 AI L4 0060 voooooo58 B

OObC 0060 OObE OObF 0070 0071 0072 0073 0074 co75

0076 VOOOOOOlA oc77 V0000002E 0078 vooooooc4 0079 VOOOOOOAF

V0000004A VOOOOOOOA V00000048 v00000011 VOOOOOOOA V0000004c VOOOOOOOb vooooooco V000000i30 VOOOOOOSB

l

Lll

* * L5

* t L3

B

LR S OSL LR I INCA OSL LR F OSL LA1 L4ff5 Al L4 B

TSL SK3 EO LA1 Lb/f5 AI Lb

MEMORY DATUM

364. 365. 366. 36-f.

B-A:0 368. 369. 370.

EXCHANGE IF A>B 371. 372. 373. 374.

F-G:0 315. 376. 377. 370. 379.

G=F; GO TO FINISH 380. 381.

GuF: IS A SLAVE ALLOCATED? 382. 383.

YES, GO OIRECTLV TO OUTPUT 384. 385. 306. 307.

NO, TRY TO GET A SLAVE 388. 389. 390.

OiO I GET ONE? 391. 392.

YES, GO SEND lNllIALIZATION TO SLAVE 393. 394. 395. 396.

NO, SET S TO 1, INDICATES STORING HAS BEGUN 397. 398.

GO 00 OUTPUT

INITIALIZATION FOR SLAVE S BECOMES SLAVE’S R

I+1 BECOMES SLAVE’S I

F 8ECOMES SLAVE’S F

GO 00 OUTPUT

FINISH: SEE IF I AM THE BOTTOM PROCESSOR

NO: GO FINISH

399. 400. 401. 402. 403. 404. 405. 406. 407. 408. 409. 410. 411. 412. 413. 414. 415. 416. 417. 410. 419. 2lodAlv


-123-

EXAMPLE 1 SIFTING SORT I14 APR 721 CISCO ASSEMBLER

420 421 422 423 424 425 426 427 420 429 430 431 432 433 434 435 436 437 43B 439 440 441 442 443 444 445 446 447 440 449 450 451 452 453 454 455 45h 457 450 459 460 461 462 463 464 465 466 467 468 469 470 411 472 473

CO7A voooooo5B

0070 007c 0070 007E 007F 0080 0081 0082

voooooo4rJ LR G VOOOOOOOE MEMS VOOOOOOCF LR 0 voooooooB OMEN VOOOOOO48 LR I VOOOOOOOE HEMS V0000004E LR A voooooooB OMEH voooooo53 INCR I voooooo4B LR I VOOOOOOb4 SR F v00000010 TEST VOOOOO029 SK2 LT vooooooo6 OET voooooooo STOP

* B 420.

0083 OCB4 0085 0086 0087 OCR8 OOB9

* 008A VOOOOOOc 1 LI 1 OORB voooooc79 STR R 008C vooooooco LA1 Llff5 0080 VOOOOOOAF AI Cl OOBE VOOOOOO5B B

421. 422. YES: PUT AWAY B 423. 424. 425. 4?6. 427. 420. 429. 430. 431. 432. 433. 434. 435. 436. 437. 438. 439. 440. 441. 442. 443. 444. 445. 446. 447. 44R. 449. 450. 451. 452. 453. 454. 455. 456. 457.’ 458. 459. 460. 461. 462. 463. 464. 465. 466. 46-I. 468. 469. 470. 471. 472.

STORE RESULT AT I

SEE IF BORE IS TO BE DONE

I :F

NC MORE, PUIT

MORE; SET R TO CAUSE FETCH

GO START AGAIN

* * Lb OOBF VOOOOOO4A

0090 voooooo 10 0091 V00000036 0092 vooooooc4 0093 VOOOOOOBA 0094 voooooo5B

0095 voooooo2c 0096 v00000001 0097 VOOOOOO4F CO98 VOOOOOOOA 0099 voooooo37 009A v00000040 009B VOOOOOOOE 009c VOOOOOO4F 0090 v00000006 009E VOOOOOO4B 009F VOOOOOOOE OOAO V0000004E OOAL voooooooB OOA2 VOOOOOO06 OOA3 voooooooo

LR S FINISH

S>O, GO STORE IN MEMORY

TEST SK3 LE LA1 L12ff5 AI L12 B

* SK1 EP WIN ;‘;, WAIT FOR INPUT FROM SLAVE

= : SEND B TO SLAVE

I ALWAYS SKIPS) S-0; STORE B AT G

LR B OSL SK4 LE LR G L12 HEMS LR B OMEH LR I STORE A AT I HEMS LR A OMEM DET STOP THAT’S ALL

* * *

tRUN TEXT XRUN XSNAP TEXT ISNAP %STOP TEXT tSTOP


-124-

196 197 19B 199 205 LO1 202 2b3 204 235 ZUb 2u7 2U8 2C9 210 211 212 213 214 215 216 217 210 2LY 22Ll 221 222 0014 223 0015 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 24C 241 242 243 244 245 246 247 246 243 250 251

0030 0001 0002 0003 0094 ocl35 0006 cao7 0000 0009 OljOA 0000 OOOC OGJO OOOE OGOF 0010 3011 0012 0013

OOLb 0017 0018

0040

0080

3030 0001 3932 OJJ3 3034 0005 0096 OOb7 ooofl

* * * *

*HEnO =REWORY VOOOOOOb3 VOOOOOO5F voooooo5B voooooo57 voooooo53 VOOOOOO4F VO300004B v0000c347 v30000043 VOOOOOO3F voooooo38 voooooo37 voooooo33 VOOOOOO2F VO’lOOOO2B VOODOO027 VO3OCOO23 VOOOOOOLF VOuOOOOlB vo3ocoo17 v00000013 VOOOOOOOF VO3OOOVOB voooJooc7 vooooooo3 =MEHORV 5 64 VFFFFFFFF =HEMORV S 126 v00000130 *MEML =FIERORY VOOOOOOb2 V0000005E VOOOOOC5A VOOOOOO56 VOGOOOO52 V0000004E V0300004A VO0050346 VO~~OOOOCZ VOOOJ003E V0000003A vooococ36 VO0000032

0009 OOOA OOJa oooc oouv VOOOOOO2E OOOE VOOOOCO2A OOGF VOOOOOO26 0010 vooooo:)22

EXAMPLE 2: SHELL SORT (12 APR 721 C ISCO ASSEMBLER

INITIALIZATLON OF MAIh MEMORY

196. 197. 198.

PROC MEMO REG MEMORY DC 99 DC 95 DC 91 DC 8-t DC 83 DC 79 DC 75 DC 71 DC 67 oc 63 DC 59 DC 55 DC 51 DC 47 DC 43 DC 39 DC 35 DC 31 DC 27 DC 23 DC 19 DC 15 DC 11 DC 7 DC 3 ORG OlOOff2

199. 200. 201. 202. 203. 204. 205. 2ct. 2C7. 208. 209. 21c. 211. 212. 213. 214. 215. 216. 217. 218. 219. 220. 221. 222. 223. 224. 225. 226. 227.

DC -1 DESCRIPTOR: ORIGIN OF TABLE 226. ORG C2OOff 2 229.

DC 0100 FIRST STACK ENTRY: POINTER TO DESCRIPTOR 2tO. PROC HEM1 231. REG MEMORY 232. UC 98 233. DC 94 234. DC 90 235. DC 86 236. UC a2 237. UC 70 238. UC 74 239. DC 70 240. DC 66 241. UC 62 242. DC 58 243. DC 54 244. DC 50 245. DC 46 246. DC 42 241. DC 38 248. DC 34 249.


-125-

EXAWLE 2: SHELL SORT 112 APR 721 C ISCO ASSEMBLER

252 253 254 255 256 257 25a 259 2bJ 261 262 21~3 204 ib5 266 261 268 269 27U 271 272 273 274 275 276 277 270 279 289 281 282 263 La4 2a5 286 201 288 2a9 290 291 292 293 294 295 296 297 298 299 300 3cJl 3G2 303 3ti4 305 3L;b 307

0011 OOli! 0013 0014 0015 0016 0017 0018

VOOOOOOlE VOODOOO1A VOOOOOOl6 v00000012 V0000003E VOOOOOOOA vooooooo6 vooooooo2 =HEMORY 5 64 VOOOOOO64 *HEM2 =HEHORY V00000Ubl V0300005D v03030059 voooooos5 voooooo51 V0000004D v03000049 v03000045 voooooo41 V30000030 voooooo39 voooooo35 v130000031 v00000020 VOOOOOO29 VO7000025 v00000021 voooooolD vooooocl19 v00000015 v00000011 VOOOOOOOD v00000009 vooooooo5 v00000001 *MEH3 =HEHORY VOi)OOOObO v11)00005c voooooo58 v00000054 v00000050 v0700004c voooooo48 v00000044 v00000040

DC DC DC

30 26 22

250. 251. 252. 253. 254. 255. 256. 257. 258.

DC DC

18 14

DC 10 b 2

ClOO//Z

DC DC ORG

100 HEM2 MEMORY

0040 DC PROC REG DC DC DC DC DC DC

DESCRIPTOR: LENGTH Oii TABLE 259. 260. 261. 262. 0000

5031 0002 0003 OUO4 0005 0006 0007 0008 ooc 9 OUOA 0030

263. 264. 265. 266. 267. 26.3. 2t9. 27C. 271. 272.

DC DC DC DC DC DC

73 69 65 61 57 53 49 45 41 37 33 29 25 21

273. 274. 275. 276. 277. 210. 279. 280. 281. 282. 283. 284.

OGOC OrJO OGOE OOOF 001a 0011 0012 0013 DU14 0015 0016 OG17 OOltl

DC DC DC DC DC DC DC DC

E 17 13

9 5

DC DC 205.

2.96. 207. 288. 289. 290. 291. 292. 293. 294. 295. 296. 297. 298. 299. 300. 331. 302. 303. 304.

OC 1 PROC MEH3 REG MEMORY DC 96 DC 92 DC 88 DC 04 DC 80 DC 76 DC 72 DC 68 DC 64 DC 60 DC 56

0000 000 1 oocl2 0003 0004 0005 OOJ6 0007 ootitl 0009 voooooo3c OGJA voooooo38 oool3 voooooo34 oooc voooooo 30 OOOD vo 100002c OODE v00000020 003F VOOOOOO24

DC 52 DC 40 DC 44 DC 40 DC 36


-126-

308 3u9 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 947 348 349 350 351 322 353 354 355 356 357 358 J5? jbJ 301 362 3b3

OGlO 0011 0012 0013 0014 0015 OGlb 0017 OOlfl

EXAMPLE 2: SHELL SORT (12 APR 721 C tSC0 ASSEMBLER

v00000020 v0000001c vooooool8 voooooo14 v00000010 voooooooc vooooocoe vooooooo4 voooooooo

DC DC DC oc

2 DC DC DC

* l

32 26 24 20 lb I2

a 4 0

* * LAYOUi OF STACK AND DEXRIPTOR: * + STK -----> D -----> BASE -----> . . . * P N TABLEIlJ * R TABLEl2J * .? . * TN . * TABLEIN) * * * REGISTER ASSIGNMENTS 1 1 VAL 1 INCREMENT VALUE STK VAL 2 SIACK POINTCR El VA1 3 =BASE+R =JlTABLEIRJ J K VAL 4 OUTER LOOP COUNTER L VAL 5 INNER LOOP COUNTER X VAL 6 OPERAND 1 Y VAL 7 OPERAND 2 * * * OFFSETS FROM STACK PCINTER * P VAL 1 ‘PASS NUMBER*; CONTAINS If2Jr FIRST INCR R VAL 2 PROCESSOR NllMBER, =OFFSET IN TABLE 2 VAL 3 SAVED ITEM FROM TABLFtRJ TN VAL 4 7lTABLElNII STKDEP VAL 5 DEPTH :lF STACK PER PROCESSOR l

* * * * FORWARD KEFERENCES * COUNT VAL OF SLV VAL c35 Ll VAL 037 L2 VAL 052 L3 VAL C84 St3 vAL OF0 51 VAL 03E 53 VAL 082 55 VAL cc4 57 VAL CEO

3c5. 306. 30). 3C8. 3c9. 310. 311. 312. 313. 314. 315. 316. 317. 318. 319. 320. 321. 322. 323. 324. 325. 326. 327. 328. 329. 330. 331. 332. 333. 334. 335. 336. 337. 338. 339. 340. 341. 342. 343. 344. 345. 346. 347. 348. 349. 350. 351. 352. 353. 354. 355. 356. 357. 358. 359. 360.


-127-

EXAMPLE 2: SHELL SORT 112 APR 721 CISCD ASSEHBLER

364 3b5 Jtdo 307 368 369 J7U J71 372 373 OUdCl 374 OOUl 375 GlJO2 37b OGJ3 377 olJci4 37d 0005 379 OGC6 3Bti 0037 3dl Oi)Sd 3t.2 0009 38J UUOA 334 oous 385 ococ 3a6 0030 387 OOCE ;1bB OUOF 3e9 0013 390 0011 391 0212 392 0~113 393 CQ14 394 0015 395 OOlb 396 0017 397 3Y8 JcJlB 399 OGl9 400 OOlA 4Dl OGlt) 4iri ooic 4i)3 0010 4G4 OOli 405 OClF 4Lb oil20 4ti7 0021 4U8 0022 4c9 OG23

* * *

*MICROMEt =STORE VOOilUOODO VOOOOOOAO VOSOOO37A VO’)OOGC4A V00000000 v00000001 v000c0011 vo’lm-!ouoo vooooooco v30000019 vo3oooco1 v00000ir7c v00000390 vooooou9o VOOOOC07E V000~054E COUNT v03000c90 VOOOOCD7E v0300c010 VCQOr)OOZF voooo@c5l VOGOG03CG VO’)OI;OOAF vo’)oooJ5B

* VOJ0003Cl voooDDoa3 VD3OCO 579 vooocoocl VOOOOOO42 V3000000E v’)ooCo349 v3oooonoB vooooooc2 VOOOOOC42 VODOOCOOE VO J@OOflCY VOOOOOOOB VOOOOOO4A VO700003D v000000cl1 VOOOOO07D VOOOOOoOD v00000cl01 V0000007tl vooooooc4 VO3000042

4i3 0024 411 0025 412 0026 413 GO27 414 OcJZiJ 415 0029 416 OOZA 417 OOZD 418 oczc 419 03LLl

56 VAL CES 54 VAL Cl02 ENDTE ST VAL OllC CYCLE VAL 013A

PRUC HlCPCHEH tlEG STORE LA1 0200//S AI 0200 STR STK LK STK HtHF WIN INCA MEMF LI 0 STR I WIN STR K SHR LN,lll SHR LN.111 STR X LR X SHR LN. Ill STR X TEST SK4 EQ INCR I LA1 COUNT//S AI COUNT B

LI 1 I I=SHIFT COUNTI SHL LN.(Rl) SHlFr LEFT BY COUNT STH I I l2)=2**lLCG h)-2 Ll P AR STK YEMS LR I ofltbl LI R AU STK MEMS LP. 1 OWEM LR STK HEMF WIN STR L MEMF WIN STR 13 LI TN AR STK

INITIALIZE STACK POINTER CCHPUTE 1121

FETCH N INITIALIZE COUNT TO 0

SAVE N IN K N/2 N/4

HOLD IN X

SHIFT UNTIL = 0

CCUNT ONE WHEN -=O GO TO SHIFT LCOP

STORE I I21 AT P

STORE RIMAXI AT R

RfMAXI=IlZI CCHPUTE alTABLE(Nll

GET BASE

361. 362. 363. 364. 3.55. 366. 367. 368. 369. 370. 371. 372. 373. 374. 375. 316. 377. 378. 379. 3RC. 381. 302. 3e3. 384. 385. 386. 397. 3ea. 389. 390. 391. 392. 393. 394. 355. 356. 397. 358. 399. 4co. 401. 402. 403. 404. 4c5. 406. 4c7. 4ca. 409. 41c. 411. 412. 413. 414. 415. 416.


-128-

EXAMPLE 2: SHELL SORT 112 APR 721 CISCO ASSEMBLER

423 421 422 423 424 425 426 421 428 429 430 431 432

OOZE OOZF 0030 033 1 0332 0033 0034

VOOOOC30E voooooo4c voooooo43 v00000008 v000000c1 voOooooB7 voooOoo58

l l

vo!looooo1 SLV VOOOOOOTA v000000c1 Ll V30000042 voooooooo vooooooo1 voooooo79 VOOOOOOc2 voooooo42 V00000000 v00000001

MFMS LR K AR 0 OMEH LAI L1//5 AI Ll R

411. 41e. 419.

STORE AT TN 420. GO TO NEY PASS 421.

422. 423. 424. 425. 426. 427.

SLAVE STARTS HERE GET NEW STACK POINTER FROM MASTER INITIALIZE I TO r(2) FOR NEW PROCESSOR 428.

429. 43c. 431. 432.

CCHPUTE ZIlTABLEiRll 433. 434. 435. 436. 437. 438. 439. 440. 441. 442. 443. 444. 445. 446. 447. 448. 449. 450. 451. 452. 453. 454. 455. 456. 457. 451. 459. 460.

E=BASE+R=a(TABLEIRI 1 PUT MARKER IN TAELEIRI

LARGEST NEC NUMBER PLACE IT AS A MARKER

TABLE(R) -> Y STORE AT 2 IN STACK

GET R

0035 3036 503 7 0038 0039 003A 0038

WIN STR STK LI P AR STK HEHF WtN STR I LI R AR STK MEMF WIN STR B LR STK flEHF WIN MEHF WIN AR B STR 8 HEMH

433 434 435 436 003c 437 0030 438 003E 439 003F 440 ODCS v0I)000070 441 ou41 VOOOOOO4A 442 0042 443 0043 444 0044 445 0945 446 Ji46 447 uo47 448 0048 449 0049 450 J04A 451 0048

voooooooo v00000001 VOOOOOOOD vocJooooo1 v00050043 voooooo75 VO9OOOOOF vcl30000c1 v33000098 vooouoooB VOrJOOOOC3 VO7000042 V0000000E VOJOGOOOl VOOOOOO7F voooooo’Jtr

* v000000c2 L2 VOOOC’OO42 v03000000 v33000031 v0lJ000012 V00000070 v00000010 VO!lOOOO26 vooooooc4 VOOOOOOA4 voooooo58

* vouooooc1 vooO@onB5 v000110004 V03000OlA VOOOO003E

R:l

R=lt NO MORF PROCESSORS REP’0

LI 1 SHR CN.11) OMEM

452 004c 453 OOCD

LI 2 AR STK MEMS WIN STR Y OMEM

454 455

OO4E Ob4F 0050 0051

OG52 0053 0054 G355 0056 0057 0056 0059 0051 005t) 005c

005D DOSE OOSF 0060 0061

455 457 45li 439 LI R

AR STK MEt4F

460 461 462 4b3

WIN DECA STR L TEST SK3 GT LA1 L3f 15 AI L3 8

LA1 SLV//S AI SLV FORK TSL SK3 hE

464 465 4bb 4bl 4b8 469 470

461. 462. 463. 464. 465. 466. 46-I.

471 4 72 473

R>l; TRY TO GET A SLAVE

GET ONE?

460. 469. 470. 471. 472.

,IC&lY

474 475


;129-

EXAMPLE 2: SMELL SORT (12 APR 72) C ISCO ASSEMBLER

476 471 478 479 400 481 482 483 484 485 486 467 480 489 490 491 492 493 494 495 496 4Y7 490 499 500 501 502 503 504 505 5li6 501 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 52b 527 528 S29 530 531

0062 0063 0064

0065 0066 006 7 0068 0069 006A 0068 OObC OObD OObE 006F 0070 0071 0072 0073 0074 out5 0076 0077 0078 0019 007A 0078 007c 0070 007E 007F 0080 0081 0082 0083

OC84 OJ85 0086 0087 OOBE 0089 OOBA OOBB 008C ooao 008E 008F 0090

0091 0092 0093 0094 0095 0036

vooooooc4 VOOOOOOA4 voooooo58

l

VO000004A voooooooo vooooooc5 VOOOOO042 VOOOOOOOE v00000001 voooooooB v000000c1 VOJOOOO42 VOOOOOOOD VOOOOOOC6 VOOOOO042 VOOOOOOOE v00000001 v00000000 vooooooc7 VOOOOO042 VOOOOOOOE V0’)000040 voooooooB V000000c4 VOOOOO042 voooooooo V000000c9 VOOOOO042 VOOOOOOOE v00000001 v00000000 vooooooc5 VOOOOO042 VOOOOOOOA

l

v00000040 L3 voooooo41 voooooo7c v000000c1 VOOOOO042 V00000000 v00000001 VOOOOO061 v00000010 V0000003E vooooooc4 VOOOO&&E voooooo5.3

* Vo000004c s2 voooooooo- v00000001 V0000007E VOo0000cl voooooo98

LA1 L3//5 Al L3 B

NO; FORGET IT FOR NOW

LR STK YES; GENERATE PARAMETERS NEMF LI STKOEP AR STK iEMS WIN OMEf4 LI P AR STK HEMF Ll P+STKDEP AR STK HEMS WIN OMEM LX R+STKOEP

EMS STK

LR L OMEM LI TN AR STK ME MF LI STKOEP+Th AR STK HEMS WIN OMEM LI STKOEP AR STK OSL

LR 0 AR I STR i Ll P AR STK HEHF WIN SR I TEST SK3 NE LA1 51//s Al Sl 0

LR K HEHF WIN STR X LI 1 SHR CN.11)

SLAVE’S STACK POINTER I=a(O)l

COPY 0

473. 474. 475. 416. 477. 470. 479. 480. 481. 402. 403.

cam P

484. 485. 486. 407. 488. 489. 490. 491.

STORE ‘R-l’ AT SLAVE’S R 492. 493.

(R-1, FROM ABOVE) 494. 495. 496.

COPY TN FOR SLAVE

SEND NEY STACK POINTER TO SLAVE

INITIALIZE K=il~TABLElR+I~J GET I(21

497. 498. 499. 500. 501. 502. 503. 504. 505. 506. 507. 508. 509. 510. 511. 512. 513. 514. 515.

1121:1 516. 517. 518.

t=IlZIi NO DATA DEPENDENCY. GO SORT 519. 520. 521. 522.

CHECK IF SORTIP-l.R*I) IS DONE 523. LOOK AT TABLE(R+Il 524.

525. 526.

GENERATE MARKER 521. 52B.


-130-

EXAMPLE 2: SHELL SORT (12 APR 721 C ISCO ASSEMBLER

532 533 554 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558

0097 0098 0099 OiJ9A 009a 009c

0090 OOYE OOYF OOAO OOAl

voooooo66 v00000010 VOD00003E VOOOOOoc4 VOOOOOORl voooooo58

* VOOOOOO3F voooooo4c 51 voooooooo v00000001 VOOOOOOlE

*

SR X TEST SK3 NE LA1 52//S Al s2 e

SK4 NE LR K MEHF WIN STR X

2 IS TEMPORARILY STORED LR X SR Y TEST SK3 LT LA1 53//s Al 53 B

COMPARE TABLE(R+ll YITH MARKER

LOOP UNTIL MARKER IS GONE

525. 530. 531. 532. 533. 534. 535. 536. 537. 538. 539. 540. 541. 542. 543. 544. 545. 546. 547. 540. 549. 550. 551. 552. 553. 554. 555. 556. 557. 558.

(ALWAYS SKIPS) GET TABLE(R+I)

IN Y AS WELL AS IN STACK

X:2

GO TO OUTER LOOP

OOA2 V0000004E 0013 VOOOOO067 OOA4 OOAS

voooooo 10 VOOOOOOZA voooooocs voooooo82 voooooo58

* vooooooc3 VOOOOO042 VOOOOOOOE V0000004E VOOOOOOOtl voooooo4c VOOOOOOOE VOOOOOO4F VOOOOOOOR voooooo49 53 vrJooooo44 V0300007c

OOA6 OOA7 OOAM

OOAY OGAA COAL3 OOAC OOAO OOAE OOAF OOBU

LI 2 AR STK MEMS LR X OHEM LR K

INTERCHANGE x AND 2 STORE X AT 2 IN STACK

2(TABLEIR*II)

STORE 2 AT TABLEtR+Il

K<-K* I

559 560

MEMS LR Y

b61 OOBL 5~2 OGB2

OHEH LR I 559.

560. 563 564 565 566 567 560 5b9 570 571 572 513 514 515 576 577 578 519 580 501 582 %33 584 565 586 507

0083 ooil4

AR K STR Ll fN AR STK MEMF WIN SR K TEST SK3 GE LA1 54//s AI 54 B

LR K MEMF STR L WIN STR X LR L SR I STR L SR B

561. 562. 563. 5t4. 565. 566. 561. 560. 569. 570. 571. 572. 513. 574. 575. 516. 577. 578. 579. 580. 581. 582. 583. ia4. 1lCWll

OOd5 vooooooc4 0086 v00000047 0087 0008

voooooooD v00000001

0089 VOOOOO064 OOaA v~30000010 ii(TABLEtN)):K

‘OONC WITH INSERTION SORl

lTASLEtR+V+Il I

OOdB OOBC OOBO OOBE

OOBF ooco ooc 1 OOC2 ooc3 ooc4 ooc5 OOC6 OOC? ooce OOCY OOCA

VfJ000003A voooooocB VOOOOOOAZ voooooo58

* voooooo4c voooooooo v00000070 vooooooo 1 V0000007E V00000040 55 VOOOOO061 V00000070 VOOOOO063 VOOOOO061 v00000010 V0000003A

SR 1 TEST SK3 GE

L:R+I


7131,

588 OOCB 589 uocc 590 OOCD 591 592 OOCE 593 OOCF 594 0000 595 DOD1 596 0002 597 0003 59tl 0004 599 0005 600 0006 601 0007 602 OGJ8 603 0009 604 OCOA MS 0006 606 oooc 607 0000 608 O(iOE 609 OOSF 610 611 ODE0 612 ODE1 613 OOE2 614 00E3 615 OOEI 616 617 616 ODE5 619 OOEb 620 OOE? 621 GOEB 622 OOEY 623 OOEA b24 OOEB 625 OOEC 626 OOEO 621 OOEE 628 03EF 629 OOFO 630 OOFl 631 OOF2 632 b33 OOF3 b34 OOF4 635 OOFS 636 OOFb 631 COF? 638 OOFLI 639 OOFP 640 OGFA 641 OOFB 642 OOFC 643

EXAMPLE 2: SHELL SORT I12 APR 721 CISCO ASSEMBLER

vooooooc7 VOOOOOOAS voooooo58

V00000040 V00000000 v00000001 VOOOOOO7F voooooo66 v00000010 V00000G40 voooooo41 VOOOOOOOE VOODOO026 vooooooc7 VOOOOOOAO voooooo58 VOOOOOO4F voooooooB VOOOOOOCb VOOOOOOA4 voooooo58

VO000004E VOOOOOOOB vooooooc5 vooooooB2 VOGOOO058

v’ooooooc3

v00000010

VOOOOO042 V00000000

voooooo48

v00000001 VOOOOOO7F voooooo66

v00000041 VOOOOOOOE VOOOOO026 v000000c1 vooooooao voooooo58

VOOOOOO4F voooooooB V000000c3 VOOO@OO42 VOOOGOOOE V3000004E VOOOOOOOB vooooorJc5 voooooOB2 voooooo58

*

* 57

* * 56

*

*

LA1 S6//5 AI 56 8

LR L ME MF WIN STR Y

::ST X

LR L AR 1 UEMS SK3 GT LA1 57//s Al 57 B LR Y OMEH LA1 55//s Al 55 El

LR X OHEM LA I 53//s AI 53 B

LI

TEST

2 AR

LR

STK MEMF

B

WIN STR Y SR X

AR I HEMS SK3 GT LA1 58//S Ai SE I3

LR Y OMEM Ll 2 AR STK MEMS LR X OMEM LA1 53//s Al 53 B

L<R+I: GO 00 LAST ONE

L=R

L>=R+I: 00 NEXT

y:x

X>=Y; INSERT HERE

X<Y; RIPPLE Y

GO TO INNER LOOP

INSERT X AT TABLElL*I I

GO TO OUTER LOOP

LAST COMPARISON: USE 2

2:x

STORE 2 AT TABLEIR+Il

SlURE X AT 2

GO TO OUTER LOOP

585. 586. 58f. 588. 589. 590. 591. 592. 593. 594. 595. 596. 597. 598. 599. 600. 601. 602. 603. 604. 605. 606. 601. 6C8. 609. 610. 611. 612. 613. 614. 615. 616. 617. 618. 619. 620. 621. 622. 623. 624. 625. 626. 627. 620. 629. 630. 631. 632. 633. 634. 635. 636. 637. 638. 639. 640.


-132-

EXAMPLE 2: SHELL SORT (12 APR 721 CISCO ASSEHRLER

b44 * b45 OOFD V030C1004E SB 646 OljF t vc9ocoGoB b47 CCFF v5oor)noc5 648 OlCC vOOo~oo~2 649 Cl131 v300~0058 65J *

LR X OMCH LA1 s//5 AI 53 B

INSERT X AT TABLE(R+Il GO TO OUTER LOOP

651 1 b52 OlJ2 v00000047 54 b53 0123 v00010c90 054 OlU4 voJo~m7Y

LR I WR LN.11) [<-I/Z STR I

655 OIG’, v000000c2 LI R 656 31Ub VOOO60342 AR STK 653. 657 Cl07 VODO3Oil@D MEMF 654. 658 oiue vooosco~i WIN 653 0129 V’)JODGO61 SR I R:I 6f.6 OlUA VO~JOUOO13 TEST bhl OlOtl VOOOOO~36 SK3 LE 6b2 663 064 bt.5 665 6t.7 6bd 6b9 b7C 671 372 673 674 675 076 077 678 679 080 b&l1 682 iJll3 684 bs5 686 b87 bdB 689 693 691 b?.? b93 694 b95 I96 691 698 099

CLDC VcOOGoCca 0130 vo~ococIBc 310E VC00C005A

013F v030000c3 0113 VO3OCOO42 0111 V9000000D 0112 vo!I33coc’3l 0113 VOOOOOO7F 0114 VOCOGO31A 0115 V0000032E 0110 v9300c0c4 0117 v0000031\4 OLlcI voooooo5a

0119 voooPoDc2 OllA v00000232 JllL-5 v01)000ir58

SllC vooooooc3 OllD VO5000042 DllE vcooooooD OllF VO~OOO04B 3120 V0300030E 0121 v0c000001 0122 v03uiJ0000 5123 VOOOoOOc2 0124 VOOOOO042 0125 VOOOOOOOD 0126 voooooool 012 7 v00000012 OlZtl VDOOD007D 0129 vo3ocoo1o DlLA V000D003E 0126 V000000c9 012c VOOOOOOBA 012D vOooooo5B

LA1 Al B

* LI AU MEHF WIN STR TSL SK3 LA1 AI B

l

LA1 AI B

* * ENOTEST LX

AR HEMF LR HEMS WIN OHEM LI AR HEMF WIN DECA STR TEST SK3 LA1 AI B

ENDTEST// KC MORE PASSES FOR THIS R ENOTEST

2 STK

RElNlTIALIZC Y=Z FOR NEXT PASS

Y

EQ L3//5 L3

(MORE PASSES TO GO) SLAVE ATTPCHEO? YES; GO TC NEXT PASS

L2//5 NO; TRY AGAIN TO GET ONE L2

2 STK

6

STORE 2 AT TABLE(R)

R NC MORE PASSES FOR THIS R STK

L HOLD R-l R:l

NE CYCLE//S A=1 ; THAT’S ALL CYCLE

641. 642. 643. 644. 645. 646. 641. 640. 649. 650. 651. 652.

655. 656. 657. 658. 659. 660. 661. 662. 6t3. 664. 665. ht6. 667. 6t0. 669. 670. 611. 672. 673. 674. 675. 676. 617. 618. 679. 660. 681. 612. 683. 684. 685. 6Bb. 607. 680. 689. 690. 691. 692. 693. 694. 695. 656.


-i33-

70u 7Cl 7”Z lu3 7L4 7d5 7Ub lL7 728 709 71L 711 712 713 714 715 716 717 710 719 I 2 0 721 722

012E 012F J13U 0131

Cl32 11133 ii134 tJ135 0136 UIJ7 013b 0139

013A

EXAHPLE 2: SHELL SORT II2 APR 721 CISCI-I ASSEFlRLEP

* VODO@OOlA v33000020 V03000006 v39000000

* VOOOOO:)c2 VO’JOOOO42 Vo!lOo0DnE V’l’IOC004D v3onoGou0 VO~JOC@L)CI v 3ooooc37 VOCCOGOSA

* *

VOOOOOO30 CYCLE t *

%RUN %SNAP ZSTOP

TSL SK2 EQ DET STOP

Ll R AR STK HEMS LR L OMEM LA1 I.1 //5 Al Ll B

STOP RETURN TO MAIN INTERPRETER

TEXT fRUN TEXT ZSNAP TEXT XSTOP FND

THIS R IS OWE. NEXT BEGUN?

YES; I MAY GO AWAY

NC: THIS PROCESSOR BEGINS AS R-l

(CONTAINS R-l FRCH ABOVE1 STORE R-l AT RI CC BECOME PPCCESSCR R-l

657. 6SO. 699. 700. 7c1. 702. 7c3. 104. 705. 7C6. 707. 7cLI. 709. 71C. 711. 712. 113. 714. 715. 716. 711. 710. 719.


-1349

tXAMPLE 3: ~AWIX HULTIPLY ill CAY 72)

1% * 1Yl 4 1% oi 199 * IhITIAL~ZAllCh OF MAIN MEMORY

0 200 201 202 203 204 205 2Ob 207 208 209 210 211 212 213 2 14 215 2 16 217 2 18 2 19 220 221 222 223 224 225

*MEMO = MtMORY =MFNORY 5 192 VFFI-Fkk50 VFkFFl k57 VFFkFFFSt

PROC KtG ORG

P’FMC rEMCaY C300//2

CISCO ASSEHRLER

1Vb. 197. 198. 199. 200. 201. 202. 203.

-176 204. -169 205. -lb2 206. C 207. -147 20.5. -140 209. -133 210. 37 211. -1 212. C4OOfI2 213.

-126 -1ld -111 -104 0

-90 -82

112 101

oucc OLlCl ouc.2 oocj OOC4 ilOC5 OOCb 03c7 ooca

All1111 DC A 151521 DC AL32431 DL

vlJoccoooo A313341 VFFFFl FbD A3542

6‘1351 A5155 A55 Ab3

VFFFFTF14 VFFFFFF7H V 00000025 VFFFFFFFF =MLMflRY 5 25b VFFFFFFBZ VFFFFFFLIA VFFFFFF91\ VFFFFFF9U voooooouo VFFFFFFA6 VFFFFFFAE vcooaoi)7o VOOOOOOb5

DC DC UC DC DC DC OR’S

DC DC DC DC DC DC DC

214. 215.

0100 0101 0 102 0103 0104 0105 0106 0107 0 108

lsllll n1521 uz431 216.

217. 218. 219.

03341 642 MS1 845 220.

221. 222.

DC DC Bb3

* 226 * MATHIX C BEGINS AT 0500

PHOC WZHl HtC CEHCHY ORC 0300//Z

223. 224. 225. 226. 227. 22A.

DC 0 229. DC -68 230. DC -61 231. UC -54 232. DC -46 233. UC -39 234. DC 1 ‘I 235. DC -4 236. DC 129 237. URG 0400//Z 230.

DC -32 239.

DC -25 240. DC 0 241. DC -10 242. DC -3 243. DC 3 244.

DC 68 245.

DC -91 246. DC 157 247.

227 228 229 230 231 232 233 234 235 236 237 236 239 240 241 242 243 244 245 246 247 248 249 250 251

*MEHl =MEHOHY =MEMOPY 5 192 vocJooooLlo VFFFFFFBC VFFFFFFC3 VFFFFFFCA VFktiFFkD2 VFFFFFFD’) vooJoooll VFFFFFFFC voooooo81 =HEMUPY 5 256 VFFFFFFEO VFFFFFFE7 v00000000 VFFFFFFF6 VFFFFFFFD vooooooo3 voooono44 VFFFFFFAS V 0000009D

*

oocc OOCl uocz OJC3 ODC4 OOCS OUC6 ooc7 ooctl

A1212 ALL:2 A2532 A.3442 A43 A52 A521 A56 Ati

0100 0101 0102 0103 0104 0105 0 106 u107 0108

Bl212 b2122 82532 83442 ti43 852


-135-

252 253 254 255 256 251 258 259 260 261 262 263 264 265 266 26-l 268 26Y 270 271 272 273 274 275 276 277 278 279 280 281 282 2H3 284 285 286 287 2Un 28Y 290 291 292 293 294 295 296 2Y7 2Y8 299 300 301 302 303 304 305 306 307

EXAMPLE 3: MATRIX HULTIPLY (11 MAY 72) CISCO ASSEMBLER

*HEM 2 PROC MM2 248. =HEMORY REG MEMORY 24% =MEMORY ORG C300//2 250. S 192

OOCO VOOOOOOOA ooc 1 v00000012 oocz vouooouoo ooc3 v00000020 OOC4 V 00000027 OOCS VOOOOUOZE OOC6 VFFFFFF55 OOC 1 voooooo8a uoca voooooooo

=MEMOKY S 256

0 LOO VOOOOO036 0101 v00000030 0 LO2 voooooo44 0103 voooooo4B 0104 voooooooo 0 105 VOOOOOOSA 0106 VFfFFFFA4 0107 v00000012 3108 v00000041

*MtH3 =MfHORY =HLMOHY S 192

OOCC VOOOOOO61 UUC 1 VOOOOOO68 OK2 voooooo6F ox 3 v0000007.5 ON4 v0uu00000 uucs voo~~oOo85 OX6 VfFFFFFlA oUC7 VOOOOOOA2 OOCR VFFFFFFFB

=MEMOPY S 256

OlUC vcoooUo8c U101 voooooo93 0102 VUOOOOO9A 0 103 VOOOOOOA2 0104 VOOOOOOA9 0105 vooooooBo 0106 VFFFFFFFl 0107 vcooooooY 0108 voooooo93

A1313 A2223 A3133 A3543 A44 A53

01313 M2223 83133 63543 I344 853

A1414 A2324 A3234 A4144 A45 A54

01414 82324 ti3234 H4144 845 b54

* * * + l *

STK CADK

DC 10 251. DC 18 252. DC 0 253. DC 32 254. DC 39 255. DC 46 256. LX -171 257. DC 136 258. DC 0 259. ORG 0400//Z 260..

DC 54 261. DC 61 262. UC 60 263. DC 75 264. DC 0 265. DC 90 266. DC -92 267. DC ia 268. DC 65 269. PROC HEM3 270. HEG MEHCRY 271. OHG 0300//Z 272.

DC 97 273. DC 104 274. DC 111 275. DC 118 276. DC 0 277. DC 133 278. DC -134 279. DC 162 280. DC -5 281. OHG 04UO//2 282.

IJC 140 DC 147 UC 154 VC 162 DC 169 IX 176 DC -15 DC 9 DC 147

283. 284. 285. 286. 287. 288. 289. 290. 291. 292. 293. 294. 295. 296. 297. 298. 299.

KEG1 SlEH ASSIGhHENTS

VAL 1 STACK PC1 NTER VAL 2 DESTINATION AUDRESS


-136-

EXAMPLE 3: MATRIX MULTlPLr (11 MAY 721 CISC[I ASSEt4HLf.K

308 3OY 310 311 312 313 314 3 15 316 317 318 3 19 320 321 Jr?2 323 324 325 326 327 328 32Y 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 35Y 360 361 362 363

LtN INOt X

Loo J A AAUR b BADP * STKBA SE PUSH * * l *

IbASE

VAL VAL VAL VAL VAL VAL VAL VAL VAL

LENGTH ANC WIIlTtl I-IF MATWICkS CDUNTEW FCR INNLH PHOltllCl LULJP MCW t#.JMRER, DUHING STARTUP PRIJWICT, CURING INNLK Pk’fllJUCT COLUMN MJtWFk, UUKlNG STAHlUP LIPEHANO 1 I LUnlNG tWLT1Pl.Y AUDKESS UF I>PEKAYU 1 UPERANU 2, OUKING YULTIPLY AUDRESS 01: UPEHANU 2

VAL 2 OFFSET OF STACK BASF POINTLH, IN STACK VAL 3 LENGTH II STACK PUSH, t”ER PHUCtSSL!f’

OFFSElS FRCM STKBASE

VAL 1 I INDEX VALUE, FOR STARTUP Jt)ASk VAL 2 J INLltX ABA SE VAL 3 A VECTOR AUDHESS BbASL VAL 4 BASELEN VAL 5 t * * * * L LAYCILI OF STACK

0 VECTDI( AUDKESS LENGTH OF STACKBASE

* + 0 * * * * * t t * * * * * * * * * * *

*Pl *II =R s 1

0001 VOOOOObOO

* l INI TIAI. STK-------->FLAG <-e-e * I UASE I

AREA

JRASE I ABASE BBASE

FIRST STK-------->AAUR(l) RADRil I I

STKRASE---‘l SECOhD STK-------ZAADRlZJ BADRtZl I STKBASE--->I

INITIALLY, AADR=ARASE=AUDHlA~l,I)) 8ADR=UaASE=ADORf8Il, l~~ CAOR=Ct)ASE=ADDR~CfIrlI l LEN=LENGT H STK= INITIAL POSITION. AS ABOVE

INITIPLIZATlCh OF REGISTERS

PHOC Pl REG R ORG SrK

DC 06OC INITIAL STACK POINTER

300. 301. 302. 303. 304. 305. 306. 307. 308. 309. 310. 311. 312. 313. 314. 315. 316. 317. 318. 319. 320. 321. 322. 323. 324. 325. 326. 327. 32A. 329. 330. 331. 332. 333. 334. 335. 336. 337. 338. 339. 340. 341. 342. 343. 344. 345. 346. 347. 348. 349. 350. 351. 352. 353.

354.

2101w


-137-

364 365 366 367 368 369 370 311 372 373 314 375 316 377 310 3 19 3RO 381 382 383 384 385 3d6 381 388 389 390 391

EXAMPLE 3: MATRIX MULTIPL\r (11 HAY 72) C I SC0 ASSEMBLER

=H s 2

0002 vooooo5uo =R s 3

0003 VOOOOOOOb =ft S 6

0006 v00000300 =R S 7

000 7 v 00000400 * * * * * * * f * * HULD INNER LOAD MATRIX MtND MI.OUP MUKE MUL NEXT ROW SET SLAVt ZEMO CYCLE * * *

392 393 394 3Y5 3Y6 397 398 395) 400 401 402 403 *MICROHEM 404 = STURE 405 0000 v000000c1 406 0001 V 0000007D 407 0002 v0000001c 408 ouo3 v000000c1 409 0004 vooUoou41 410 0~05 vOOOOOOOE 411 0 006 V 00000049 412 0007 VOOOOOO08 413 oooa vouooooc5 414 0005 v00000041 415 UUOA v00000019 416 411 OOOB vU000004D 418 OOUC V OOOOUd63 419 OOOD VOOOOOOlO

* MATRIX

ORG

DC OHG

CADU

cscc LEN

ADURESS OF C(l.1)

DC 6 OWG AAOH

LENGTHGWI OTH

DC 0300 ORG BADR

ADDRESS OF A(1.11

DC 0400 ADDRESS OF Bllvll

FOKkARD REFERENCES

VAL 077 VAL G7E VAL 013c VAL OR VII. OCD VAL OAE VAL 054, VAL C’i5 VAL OllA VAL 033 VAL 060 VAL 0120 VAL CDE VAL C3FF

PHOC MICRCb’EM REG STORE LI 1 IYITIALILE I,J To 1 STH J STR I LI BASELEbSTKt3AS.E AR STK PUT STKBASE IN FLRST BLOCK iEMS L?. STK OMEt’ LI BASELtN MOVE STACK POtNTER UP TO FIRST BLOCK AR STK STR STK

LR J SK LEN TEST

J:LEN

355.

356. 357.

358. 359.

360. 361.

362. 363. 364. 365. 366. 367. 368. 369. 310. 371. 372. 313. 374. 315. 376. 377. 370. 37Y. 380. 381. 382. 383. 384. 385. 386. 387. 388. 389. 390. 391. 392. 393. 394. 395. 396. 3Y7. 398. 399. 400. 401. 402. 403. 404. 405. 406.

FIG, B-4 (continued).

-138-

EXAMPLE 3: MATRIX HULTIPLt 111 MAY 721 CISCO ASSEMBLER

420 421 422 423 424 425

OOOE VOOOOOOZA SK3 LT OOOF v oooouoc 1 LA I ROk//S

407. 408. 409. 410. 411. 412. 413. 414. 415. 416. 417. 418. 419. 420. 421. 422. 423. 424. 425. 426. 427. 42.8. 429. 430. 431. 432. 433. 434. 435. 436. 437. 430.

0010 OUll

0012 0013 0014 0015 0016 OU17 OS18 a019

vooooooti3 v00000058

t v ooooooc.2 voooooo41 vooooooou v c000u001 v00000011 VOUOOOOOE voououo5c v 0000000M v uooooou V0000U012 VOOOOOOOE vooooooco v 00300008 v 00000054 voooooo4c VOOOOOOOF v 00000040 voo~ooo11 v 0000000B voooooo54 v 0000004c VOOOOOOOE v 0000004E v oooooooe voooooo54 vaooooou V UUdOOOOE vaooooo4F v ooocao~l v oooaooo8 v000000c2 v0000008A v 00000058

* * 1

k aoaaoo4c KO w v 00000063 voaooooio V CCOCOOZA v 000000c3 v 00000087 v 00000058

* *

AI ROk 8

LI STKBASE AU STK HE MF WIN INCA HEMS EXK I LIWEP LR I UECA l4EHS LI 0 LIME M INCR I LR I lllzrs LR J INCA OMEN INCR I LH I HEMS LR AAOR OMEH INCR I LR I HEMS

42a 427 428 429 430 431 432 STORE I AT IBASE

STORE 0 IN FLAG

433 OUlA 434 0010 435 436 437

ooic 0010 OUlE OOlF 0320 0021 0022 0023 0024 0025 0026 0027 0028 0029 OOZA ooze ao2c 0020 OOLE OUZF 0030 0031 UU32

438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457

STOKE J+l AT JBASE

STORE AADR AT ABASE

LH RADH 439. INCA 440. Of4EH LA1 MORE 1/S AI MORE 8

STORE BADR AT WASE

GO GET A SLAVE

J=LEN

I :LEN

I=LEN: GO 00 LAST ONE

441. 442. 443. 444.

458 459

445. 446. 447. 440. 449. 450. 451. 452. 453. 454. 455. 456. 457. 450. 459. 460. 461. 4b2.

460 461 462

0033 0034

LR I SK LEN TEST SK3 LT LA1 HOLD//5 AI HOLD B

4b3 464

0035 0036

465 466 467 468 469

0037 0038 0039

ii; STKBASE STK

I <LEN 470 471

003A 0038

voooaooc2 v 00000041

472 003c v00000000 ME MF 473 0030 v ooaaoooi HIN 474 OU3E voooaooll INCA 475 003F V0000000E HEMS STORE I+1 AT IBASE


-13%

EXAf4PLE 3: MATKIX HULTIPLY (11 f’AY 721 LISCfl AStEMRLEP

476 477

0040 0041 0042 0043 0044 UO45 0040 0047 0048 0049 004A 0048 004c uu40 OOCE ao4F OU50 0051 U052 0053 0054 0055 0056 0057 0058 OUSS OUSA au58 005c 0050 OOSE 005f

v 0000005c v 0000001 i v aoUoooo8 v 0000004c voooooa12 vaoooaoaE vooooooco voooooao0 voooooo54 voouaoo4c voaoooooE v000000c1 vaoaoooo~ voooaoo54 voooooo4c vooaaoaot v OOUU004E v00000043 VOOOOOdOR voooooo54 voo~oaa4c

EXK I INCA LIME n LR I

463. 464. 465. 4.56. 46 1. 468. 469. 470. 471. 472. 473. 474. 475. 476. 477. 470. 479. 4H0. 481. 402. 403. 484. 485. 486. 487. 460. 489. 490. 491. 492. 493. 494. 495. 496. 497. 490. 499. 500. 501. 502. 503. 504. 505. 506. 507. 508. 509. 510. 511. 512. 513. 514. 515. 516. 517. 518. 1IWA4b

478 479 480 c&i1

OECA HEMS

482 483 484

LI 0 OHEM INCR 1 LU I MEMS LI 1 OHE El INCR I LR I MEHS LR AAOR AR LEN UHEH INCH I LR I MFMS LR DAOR SR LFN tNCA OHEH LA1 SLAVE//S AI SLAVE FORK TSL

STORE 0 AT FLAG

485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 5Jl 502 503 504

STORE I AT JBASE

STORE AAoR+LEN AT ABASE

V0000000E VOUOOOO4F voooaoob3 vaouoooi1 v aocoUoof3 VOOUOOOC9 MUKE v 0UU000A0 vooooooo4 voaooaald

STORE BAOR-LEN+1 AT RBASE

TRY TO GET A SLAVE

NONE, GO SET FLAG

GOT ONE, SEND STARTUP

STACK PO1 NTER

DESTINATICN ADDRESS

LENGTH

GO 00 INNER PRODUCT

SET FLAG TO 1: MORE PROCESSORS NEEDED

505 506 507 508

V0000003E voaooooc3

SK3 NE LA1 SET//5 AI SET 0060

OObl

0062 0063 0064

'0065 0066 0067 OJf,R 0069 006A OObd 006C

vaoooao~o vaooaoo58

* v ooooooc3 vc0000041 V00000U0A v 0000004* v00000011 V OOOOOUOA voooooa4R vaaooooo~ voo0oooc3 V OOUOOOBE v coUooa50

* *

vaaoaooc2 SET vcooaoa4i

509 510 511 512 513

B

LI PUSH AR STK OSL LR CAOH INCA OSL LR LEN

514 515 516 517 518 519 520 521 522 523 524

OSL LA1 INNER115 AI INNEP B

0060 OObE 006F 007c

LI STKBASE AR STK HEMF

‘WIN HEMS LI 1 OMEN LA I INNER//5

525 526 527 528 0071 v 00uOoi)Of

0072 v000000c1 0073 vooaaaao~ 0074 vaoooooc3

529 530 531

FIG. B-4 (contiliued).

-140-

T F P

0

EXAMPLE 3: HATPIX MULTIPLY (11 CAY 721 C I SC0 ASSEI4BLEP

588 589 590 591 592 593 594 595 596 597 598 599 6OU 601 602 bO3 604 605 606 607 bU8 609 6 10 011 612 tJ 13 614 615 bib 617 618 619 620 621 622 623 b24 625 626 627 628 629 630 631 632 633 634 b35 636 637 638 639 b40 641 642 643

OOAB OOAC OOAO OOAE OOAF 0080 0001 out32 0083 0084 oot)5 OOHb 0087 iJOB8 oad9 OUBA 0088 OOBC OOBD OORE IJOBF ooca ooc I oocz oac3 OUC4 ooc5 OOCb ouc7 0acB ooc9 OUCA OOCB oucc

OJCO OOCE OOCF 0000 OOUl aou2 0003

0004 OUIJS 0006 OUD7 0008 0009 OOUA 8000 ouoc 0000 OUOE OOOk

v cuoaou58 * *

vOOOOOU9C HEND voooooola vcUoooo3n v 0000004u voooaool3 vaooaooll vaooaoo7u

* v 0000004A v 0ooaoaoF voooooaol v 00000045 voooouoaB v 00000033 voooooaFD VOOUOOOOE V000000FF voaoaauo~ voaaooo54 2tRO v oooooou

vcoUooa13 vaaoaaoll V 0000007F v aooaa04E MLOOP vooaaao98 vaooaoolo VOOOOOO3A vaoa00040 voooooo47 v ooilooo7o vaUooUo4E vaoocoo90 V 0000007E v00000010 vOOOOOO3F vaooaoocb V000000AD v 00000058 voouauo4F v00000080 v 0000007F v aooaaaio voooooo28 v coo00022 voooJooc5 VUOUOOOAE v 00000058 V OOUOOOFE vaouoooo~ V COOOOOFF voUoaooo0 VOUOOOaCb v 000000BE

-A INCA SIR R LH A SHR CN,ll) TESI SK3 GE LR PROD AR d STR PRO0 LR A SHR LN.11) STR A TEST SK3 NE LA1 MENU//S AI PEtill B LH 0 SML LN,ll) STH B TEST SK4 LT SK3 OV LA1 CLOCF//S AI PLCOF B LI -2 MEMS LI -1

IF LINK IS NOW 1, RESULT IS TU dE NEGATIVE

PUT LOW-OAUER BIT OF HULPLR INTO SIGN BIT

IF HEHOVED BIT=1 (TEST NEGATIVE) ADi) THE tWLTIPLICAND

SHIFT AYI) TEST THE MULTIPLIER

StE t)’ A WAS ZEHO AFTEK RIGHT SHIFT WNE WITH PULT IPLY

SHIFT IHE MULTIPLICAND

MULTIPLICAND OVERFLOW? ISEE If &IT SHIFTt!J INTO SIGN) CHECK FOR WLRFLUW ON AUD NU UVERFLOk OF EIT!tER KIlUOt GO ON

UHEM PRODlJCT JVERFLOW: SET GLOBAL FLAG LA1 ZLRO//S SKIP THE SUM ON OVtRFLOW AI LERC B

SHR CL,(l) TEST SK4 GE LR PRO0 -A INCA STR PKOO

LR CAOR HEHH WIN AR PROIJ OHEY SK4 NO LI -3 t4EMS LI -1 OMEH INCH INDEX LR INDEX

FIX UP SIGN--PUT LINK INTO SIGN BIT

RESULT SHOULD BE NEGATIVE

A00 PRODUCT TO SUM

SUM OVEHFLOW? YES, SET FLAG

INCREMENT THE COUNTER 1NDEX:LEN

575. 576. 577. 570. 579. 580. 581. 582. 583. 584. 585. 5R6. 587. 50R. SBY. 590. 591. 592. 593. 594. 595. 596. 597. 598. 599. 600. 601. 602. 603. 604. 605. 606. 607. 608. 609. 610. 611. bl2. 613. 614. 615. 616. 617. 618. 619. b20. 621. 622. b23. 62% 625. 626. 627. 620. 629. 630.


-142-

EXAMPLL 3: MATRIX MULTIPLY 111 MAY 72) Cl FCU ASSEHtlLEH

631. 632. 633. 634. b35. b3b. 637. 638. 639. 640. 641. 642. 643. 644. 645. 646. 647. 648. 649. b50. 651. 652. 653. 654. b55. 65b. 657. b58. 659. 660. 661. 662. 663. 664. 665. 666. 667. 668. 669. 670. 611. 672. 673. 674. 675. 676. 677. 670. b79. 680. 681. b82. 683. 684. 605. 606.

644 OOEO b45 DOE1 646 OOE2 647 OOE3 648 OOE4 649 OOE5

OOE6 OOE7 OOEB OOE9 OOEA OOER OOEC OOEO OOEE OUEF OOFO OOF 1 OOF2 Oi)F3 OOF4 iJOF5 OOFb OOFl OOF8 OOF9 OOFA OOFB

VOOOOO063 v00000010 V OOUOOO36 voooooocn V OOOOOOHA v 00000058

* v 00000049 V OOOOOOOF v00000001 v00000011

L LEN

SK3 LE LA1 hEXT//5 Al NEXT

THAT’S ALL

AADR<-AAUR+L

a 650 651 b52 b53 b54 655 656 657 b5tJ 659 660 661 662 b63 664 bb5 666 667 b60 669 670 671 672 673 (174 675 676 677 b78 67Y bkl0 681 602 683 684 685 b86 6it-i 688 609 690 691 (192 693 694 695 696 097 b-J8 699

LR STK HEMP WIN INCA OHEf’ MEMF YIN STR A LI 1 AR SlK MEMM UIN AR LEN UUEM ME HF HIN STR B TSL SK3 EP LA1 PUL//S Al PUL B

v OOOOOOOR voooooooLl v 00000001 v 0000007t voooooocl v 00000041 VOOOOOOOF v c0000001 v 00000043 voooooooB V OOOOOOOD voooooool VOOOOOO7F VOOOOOO1A V 0000002E vooooooc4 v 000000B5 v 00000058

* *

v000000c2 v cooooo41 VOOOOOOOD v 00000001 v 00000000 voooooool v00000010 V0000003E voooooOc4 voooooot35 v 00000058

* vooooooc9 VOOOOOOAU voouoooo4 VOOOOOOlA VC000003E v000000c4 vooooooB5 voooooo5~

t vooooooc3 v 00000041 VOOOOOOOA V 0000004A

FETCH NEXT FROM AAOR

8ADR<-dADR+LEN

FETCH NEXT FROM BAi)R

SLAVE tXISTS: GO ON

NO SLAVE: LI STKBASE AR STK

OOFC OOFO OOFE OOFF Oldil 0101 0102 0103

MEMF WIN Ht HF WIN TEST

CHECK FLAG

SK3 NE LA1 F(UL//S AI MUL

0104 0105 0106 n

LA1 SLPVE//5 Al SLAVE FORK TSL SK3 NE LA1 CUL//5 AI PUL 8

F LAG=0 i NC ACTION REWIRED

FLAG-.=O: TRY AGAlN TO GET A SLAVE

GET ONE?

0107 0108 0109 OlOA OlOB 0 1uc. OlOC OLOE

010F 0110 OLLl 0112

LI PUSH AR STK

NO: GO ON

GOT ONE, SEND STARTUP

STACK PO1 NTER OSL LR CADR


-143-

H

m .

n

756 751 758 759 160 lb1 762 143 164 765 766 761 lb8 769 170 171 712 173 174 775 llb 177 718

0146 voooouooo 0147 v 00000011 0148 v c0000070 0149 v 00000001 0 14A V 0000005D 0 148 VOOOOOOOD 0 L4C v 00000011 014c V0000007f 0 14E v00000001 0 14F V0000005E 0150 VOOOOOOOD 0151 voooooool 0 152 VOOOOOOlF 0153 vooooooco 0154 VOOOOOOAB 0155 voooooo58

EXAMPLE 3: HATRIX MULTIRL~ (11 MAY 12) CISCO ASSEMBLER

XR X SNAP PSTOP

ME HF INCA STR WIN EXR MfZMF INCA STR WIN EXR MEMF WIN STR LA1 AI I3

* * +

TEXT TEXT TEXT tNll

LOAD J FRCM JBASE

J

J LOAD AADR FROM ABASE

AADR

AAOK LOAD EIADR FROM BBASE

BADR h!ATRLX//5 t’ATRIX

GO 00 NEXT

XR XSNAP tSTOP


143. 744. 745. 146. 747. 748. 149. 750. 751. 152. 153. 754. 755. 756. 751. 750. 159. 760. 761. 762. 163.

-145-

A=

-176 0 10 97 -169 -68

18 104 -162 -61 0 111

l 0 -54 32 118 -147 -46

39 0 -140 -39 46 133

-133 17 -171 -134 37 -4

136 162 -1 129 0 -5

B=

-126 -32 54 140 -118 -25

61 147 -111 0 68 154

-104 -10 75 162 0 -3

0 169 -90 3 90 176

-82 68 -92 -15 112 -91

18 9 101 157 65 147

FIG. B-5.--Example 3 matrixdata values

-146-

Flowchart of Example 1: Sifting Sort

I contains the initial table address.

F contains the final table address.

R = 1 when fetching from memory, = 0 when receiving data from master.

S = 1 when data are to be stored into memory (due to unavailability of a slave). After a slave has been attached, S = 0 when slave is ready to receive the next item, = -1 otherwise.

slave start

A -input from master

output ‘0’ send token to master to master

t I

FIG. B-6. Example 1 - detailed flowchart.

.-147-

/+2 last item = m

tOken not yet received

store B at MEMORY [j _

r output B to slave

S--l

* slave start

1 L v

S-l

output S to slave S becomes slave’s R

output I + 1 output F

* 2106A61


-148-

.

store B at MEMORY [G]

output B to slave

4

store A at MEMORY [I]

I I this processor is done

last processor

this processor restarts

2106A62


-149-

Initially, P - 12 = 2 kw2d -2

B = address (TABLE (R))

initial K = address (TABLE (R + I))

initial L = K - I

I TN -addr (TABLE (N) )

I I

FORK

4

2106063


-150-

STK + STKDEP

pwaddi (TABLE (R+ I)) 1

AL

X--(K) yes

b I

Y contains TABLE (R) &“”

FIG. B- 7 (continued).

-151-

K-K+1

L-K x--w

I

ripple Y

last item

Y-V-J)

I

go to next sequential item


-15%

2106A65

> + (B+I)-Z z-x

(B + 1)-X L

= 11 Y-Z 1 I I alreadv have

try to get a slave

FIG. B- 7 (continued).

become next processor

2106Md

-153-

Assume register STK, LEN, AADR, ABDR, CADR are initialized

other registers: I=PROD J = INDEX A =AADFZ B = BADR

Stack layout:

STK initial - STKBASE - FLAG IBASE JBASE ABASE BBASE

STK (l)- AADR (1) BADR (1) STKBASE

STK (2)- AADR (2) BADR (2) STKBASE

- .

start 4

I-l

J-l

FLAG-O

ABASE -AADR + LEN BBASE - BADR - LEN + 1 FLAG-O IBASE-I+1 JBASE - 1

ABASE -AADR BBASE - BADR + 1 1 +- FLAG-O IBASE -I JBASE-J+l

FORK

no

1 output

FLAG- 1 STK+3 CADR + 1 LEN

I


-154-

(STK) - AADR (SE; 1) r EADR

(CADR)-0 A - (AADR) B-(BADR)

c

4

J out!Jut STI;; + 3

- CADR + 1 LEN

multiply and .sum (CADR) -(CADR) + A * B

1

INDEX -INDEX + 1

I FORK

#O

=o FLAG

s no

yes slave

?

FIG, B-8 (continued).

-155-

APPENDIX C

List of Z-Machine Oneration Codes

Control group

(hex) mnem description

00

01

02

03

04

05

06

07

08

09

OA

OB

oc

OD

OE

OF

STOP deallocate self and wait for startup

WIN wait for input

FORK request allocation of slave

OSLF output data to slave, with flag

DET detach self from master (if any) and slave (if any)

OMAF output data to master, with flag

OMA

OSL

OMEM

MEMR

MEMF

MEMS

MEMM

output data to master

output data to slave

output data to memory start memory

start memory

start memory

start memory

Miscellaneous group

10 TEST 11 INCA

12 DECA

13 1A

14 ALNK

15 LNKl

16 LNKO

17 1Lm

18 AOV

19 TMA

cycle-destructive (“half ‘I) read

cycle-full read (“fetch”) and rewrite

cycle-store

cycle-read and lock (‘*read-modify-write17)

set condition code according to contents of A add 1 to A

subtract 1 from A complement A

add LINK to A

set LINK to A

set LINK to 0 invert link

add OVERFLOW to A

test if MASTER = 0 (i. e. , executing at top level)

-156-

. ,. :

(hex) 1A

1B

1c

1D 1E

1F

Skip group

20

24

28

2c

30

34

38

3c

mnem

TSL

TIF TIN

SKOV SKGT

SKLT

SKEQ

SKNO

SKLE

SKGE

SKNE

Register group

40 AR 48 LR

50 INCR

58 EXR

60 SR

68 NR

70 OR

78 STR

description

test if SLAVE = 0 (1. e. , w slave allocated)

test for flag on last input

test input source (“0” = MEMORY, “< 0” = SLAVE, >0 ” > 0 ” = MASTER)

(lower 2 bits of instruction = n - 1)

Skip n on OVERFLOW = 1 skipnonA> 0 =MA

skip n on A < 0 = SL assembler version:

skip n on A = 0 = MEM

skip n OVERFLOW = 0 SKm XX where

skipnonA< 0 1m XX = two-character condition specfi-

skipnonA> 0

skip n on A = 0 1SL cation

1MEM m =l, 2, 3, or4

(r = 0 ) . . . . ‘. , ‘7) (register 0 is the program counter)

add register r to A load A from register r

add 1 to register r

exchange A with register r

subtract register r from A

AND register r into A OR register r into A

store A into register r

Shift group (lower 2 bits give shift amount: 021 bit, 1 >3 bits, 2 8 bits, 3 (register 1) )

80 SHLLN shift left logical nolink

84 SHLAN shift left arithmetic nolink

88 SHLCN shift left circular nolink

8C SHLCL shift left circular link

-157-

(hex) mnem description

90 SHRLN 94 SHRAN

98 SHRCN

9c SHRCL

shift right logical nolink shift right arithmetic nolink

shift right circular nolink

shift right circular link

Immediate ins true tions

co LI i load immediate (lowey 6 bits of instruction are loaded into A)

A0 AI i append immediate rlower 5 bits of instruction are shifted (left, loiical, nolink) into A from the right.

-158-

APPENDIX D

Result Data Example 1

KlJ&JG

2 13,423 0.552

4 7,412 0.305

8 4,374 0.180

15 3,094 0.121

a 2.15 1 24,413 1.000

2 13,182 0.538

4 1,256 0.296

8 4,257 0.114

15 3,023 0.123

4 10 1 28,408 1.000

2 14,428 0.501

4 8,856 0.312

8 6,148 0.216

15 5,652 0.199

3 7 1 21,354 1.000

25,551 21,952

26,416 21,401

26,896 20,950

25,011 21,077

25,822 21,401

26,143 20,950

26,496 20,163

28,404 23,281

21.444 21,862

32,396 21,263

40,824 20,824

57,112 20,724

27,351 23,281

29,601 21,952

31,968 21,359

35,454 20,932

45,651 20,758

ALLCC. PAR.

-irG---

1.90

3.56

6.15

8.78

1.00

1.90

3.56

6.14

8.76

1.00

1.90

3.66

6.64

10). 21

1.00

1.91

3.64

6.46

9.99

MER / ,,+ 1.64 86

2.89 81

4.19 78

6.70 76

0.95 96

1.67 88

2.95 84

4.92 81

6.87 79

0.82 18

1.52 81

2.40 65

3.39 51

3.67 36

F WAIT 4

14

19

22

23

4

12

16

19

21

21

17

29

43

53

11

24

30

37

48 A

OA % BUS BUS

WAIT B. W. 0 11

0 14

0 22

0 33

0 44

0 3

0 4

0 6

0 9

0 11

I 36

1 54

5 15

6 95

10 91

6 28

3 37

3 57

3 80

5 so

Example 2 4 10 1 38,716 1.000 38.112 21,682 1.00 0.72 68 29 3 58

2 26,888 0.695 45,064 27,762 1.68 1.03 58 34 8 86

4 26.548 0.685 68,668 21,810 2.59 1.05 38 41 21 88

8 26,526 0.685 92,592 21,866 3.49 1.05 28 44 28 88

1 4 1 32,855 1.000 32,854 21,662 1.00 0.84 84 16 0 11

2 20,403 0.620 33,614 27,162 1.65 1.36 83 17 0 28

4 15,543 0.413 33,991 27,810 2.19 1. 79 82 18 0 31

8 13,986 0.425 31,884 27,866 2.71 1. 99 74 23 3 42

t 2 1 30,913 1.000 30,913 27,682 1.00 0.89 92 8 0 9

2 19,051 0.615 31,349 27,762 1.65 1.46 92 8 0 15

3 18,061 0.583 31,154 27,930 1.16 1.55 91 9 0 16

4 13,622 0.440 29.160 21,810 2.18 2.04 93 1 0 21

8 11,614 0.316 30,416 27,866 2.62 2.40 91 8 0 25

Example 3 1 2.125 1

2

4

8

12

4 10 1

2

4

1

1 1 1

2

4

8

11

l-- 38,154

20,736

10,718

6,316

4,927

48,156

28,043

11,460

15,740

46,261

24,512

12,626

1,245

5,916

1.000 38,153

0. 544 37,529

0.281 36,891

0.166 38,550

0.129 40,909

1.000 48,152

0.583 50,796

0.363 62,252

0.321 85,600

1.000 46,260

0.530 44,488

0.273 43,508

0.157 44,724

0.128 46,129

36,611

35,876

34,912

34,504

34,301

36,617

35,194

34,140

34.307

36,617

35,807

34.911

34,445

34.301

1. 00 0. 96

1.81 1. 73

3.44 3.26

6. 10 5.46

8.30 6.96

1. 00 0. 16

1.81 1.28

3.56 1.99

5.44 2.18

1. 00 0.79

1.81 1.46

3.45 2.11

6.17 4.15

7.80 5.80

12

22

39

63

79

39

63

93

98

10

18

33

54

66

-159-