SLAC-161 UC -32
COMPUTING WITH MULTLPLE MICROPROCESSORS*
JOHN V. LEVY
STANFORDLINEARACCELERATORCENTER
STANFORD UNIVERSITY
Stanford, California 94305
PREPARED FOR THE U. S. ATOMIC ENERGY
COMMISSION UNDER CONTRACT NO. AT(04-3)-515
April 1973
Printed in the United States of America. Available from National Technical Information Service, U. S. Department of Commerce, 5285 Port Royal Road, Springfield, Virginia 22 15 1. Price: $3.00; Microfiche $0.95.
*Ph. D. Dissertation.
KEY WORDS AND PHRASES
asynchronous parallel processor
computer architecture
computer design
emulation
FORK- JOIN
instruction overlap
instruction set
machine language I
microprogramming
multiple processor
multiprocessing
parallel processor
sorting
Computing Reviews categories: 4.32, 6.20, 4.21, 4.13
ABSTRACT
Computer systems with multiple processors are becoming more common
for reasons of reliability and modularity. The use of asynchronous processors,
however, leads to problems of complexity of control and of programming. This
work investigates the application of multiple asynchronous processors to the
computing task at the lowest level - that of interpreting single machine-
language instructions.
A particular computer configuration with 15 identical processors has been
constructed using an interpretive simulator. The processors are of relatively
low computing capacity. A common data bus connects the processors with each
other and with the main memory. A restriction on the logical connections
between processors allows each one to communicate with no more than two
others, in a chain-like arrangement.
Three examples - two sort instructions and a matrix multiply - were
coded for this machine and run using the simulator. By varying the bus cycle
time, it was concluded that adequate support of up to 15 processors can be
provided by a common bus with cycle time equal to the processor cycle time.
The amount of parallelism achieved was significant but showed dependence
on hardware parameters and on the algorithm implementations. Direct simu-
lation of the computer, with an execution trace of the running system, has
yielded some glimpses of how restriction of bus capacity can cause deteriora-
tion of the program execution efficiency and amount of parallelism.
A simple economic model of a multiple processor system is developed and
applied to the three examples. The result shows that the minimum cost per
throughput occurs with four, eleven, and fifteen processors, respectively, for
the three examples when the cost of a processor is one-tenth of the system cost.
. . . - 111 -
ACKNOWLEDGMENTS
I acknowledge with gratitude the persistent guidance of William F. Miller,
Vice-President and Provost of Stanford University, who has also taught me
something about university administration.
The research for this dissertation has been supported over the years by
the able assitance of Harriet Canfield, Kathleen Maddern, Linda Lorenzetti,
Olga Stanners, and Carla West. Mrs. Canfield was particularly helpful at the
crucial time.
Discussions and arguments over many years with my coneague Victor
Lesser and more recently with Maurice Schlumberger have been instructive.
I appreciate the efforts of Professors Edward J. McCluskey, Jr., Edward S.
Davidson, andForest Baskett in reading and commenting on this thesis and
fondly acknowledge the memory of George E. Forsythe, who should have been
among them and us all.
Finally, credit is due the untiring devotion of WYLBUH and the IBM
System/ 360 Model 91 at the Stanford Linear Accelerator Center, whose support
never slackened.
- iv -
TABLEOFCONTENTS
Page
CHAPTER 1
Introduction and Annotated Bibliographg ............ 1
Introduction ...................... 1
Programming Context. .................. 3
The Multiple-Processor Approach. ............ 8
Literature Survey and Annotated Bibliography. ....... 12
CHAPTER 2
Computer Hardware Description. ............... 34
The Multiple Processor Model .............. 34
Description of the Z-machine ............... 39
The Processor ................... 40
The Microprogram Store ............... 46
The Main Store ................... 46
The Bus and Allocator ................ 51
The Allocator .................... 51
CHAPTER 3
Description of Example Programs .............. 54
Introduction ...................... 54
Expected Run Time Variation .............. 55
Processor Allocation and Deallocation .......... 63
Example 1: Sifting Sort. ................ 65
Example 2: Shell Sort. ................. 67
Example 3: Matrix Multiply ............... 71
-V-
Page
CHAPTER 4
Measurements and Results .................. 76
Measurements ..................... 76
Results. ........................ 80
1. Bus and Memory Utilization ........... 80
2. Bus Capacity .................. 81
3. Main Memory Speed ............... 87
4. Amount of Parallelism .............. 88
5. Run Time versus Number of Processors ...... 96
6. Cost-Performance Analysis. ........... 96
CHAPTER 5
Unexpected Results and Possible Extensions .......... 106
Results from Execution Trace .............. 106
Processor-to-bus Buffering ............... 109
Program “overhead”. .................. 112
Data Bus Design .................... 113
Hardware Design Alternatives .............. 114
Software Alternatives and Extensions ........... 114
APPENDIX A - The CISCO Simulation System ........... 117
APPENDIX B - Detailed Flowcharts, Data Values, and
Assembler Listings ............... 118
APPENDIX C - List of Z-machine Operation Codes ........ 156
APPENDIX D - Result Data ................... 159
-vi-
LIST OF ILLUSTRATIONS Page
l-l
l-2
2-l
2-2
2-3
3-la
3-lb
3-2
3-3
3-4
3-5
3-6
3-7
4-l
4-2
4-3
4-4
4-5
4-6
4-7
4-8
Four levels of computer operation ............ 4
Programmer’s View of processor connections ...... 10
Z-machine configuration ................ 40
Z-machine bus bit assignments ............. 48
Z-machine bus transactions ............... 43
Example 1 : Variation of problem cha.racteristics with
number of processors. ........... 56
Example 1: Variation of problem characteristics with
number of data elements .......... 57
Summary of problem characteristics ........... 62
Shell sort - simplified flowchart ............. 68
Shell sort - implementation strategy, N=8. ........ 70
Matrix multiply - simplified flowchart .......... 72
Stack contents for matrix multiply, N=2 ......... 74
Register contents for matrix multiply, N=2 ........ 75
Sample summary sheet ................. 77
Average utilization of bus and main memory ....... 80
Example 1: sifting sort - execution efficiency ...... 82
Example 2: Shell sort - execution efficiency ....... 83
Example 3: matrix multiply - execution efficiency .... 83
Example 1: sifting sort - bus speed dependence. ..... 84
Example 3: matrix multiply - bus and memory
speed dependence . . . . . , . . . . . . . . 85
Example 2: Shell sort - bus and memory speed
dependence . . . . . . . . . . . . . . . . 86 - vii -
4-9
4-10
4-11
4-12
4-13
4-14
4-15
4-16
4-17
4-18
4-19
4-20
4-21
4-22
4-23
5-l
5-2
5-3
5-4
5-5
B-l
Typical parallelism and execution time ..........
Example 1 - parallelism with fast bus and memory ....
Example 1 - parallelism with slow bus and memory ....
Example 2 - parallelism with fast bus and memory ....
Example 2 - parallelism with slow bus and memory ....
Example 3 - parallelism with medium bus and
fast memory ...............
Example 3 - parallelism with slow bus and memory ....
Example 1 - run time curves ..............
Example 2 - run time curves ..............
Example 3 - run time curves ..............
Constant cost curves ..................
Example 1 - cost curves ................
Example 3 - cost curves ................
Example 2 - cost curves ................
Number of processors for minimum cost/throughput ...
Sample of execution trace from Example 1 ........
Sample of execution trace from Example 3 ........
Sample of execution trace from Example 2 ........
Comparison of runs with and without bus
buffer (Example 3) .................
Microprogram overhead for multiple-processor
coordination ....................
Z-machine definitions for assembler. ..........
Page
88
90
91
92
93
94
95
97
98
99
102
103
104
105
101
107
108
110
111
112
118
. . . - Vlll -
Page
B-2 Example 1 - assembler listing ............. 120
B-3 Example 2 - assembler listing ............. 125
B-4 Example 3 - assembler listing ............. 135
B-5 Example 3 - matrix data values ............ 146
B-6 Example 1 - detailed flowchart. ............ 147
B-7 Example 2 - detailed flowchart ............. 150
B-8 Example 3 - detailed flowchart ............. 154
- ix -
“One of my primary objects is to form the tools so the tools themselves shall fashion the work, ‘I
Eli Whitney
CHAPTER 1
INTRODUCTION AND ANNOTATED BIBLIOGRAPHY
Introduction
The development of computers has been dominated since its inception by
the concept of a processor as a powerful calculating engine surrounded by
supporting facilities - random-access memory, input and output transmission
media, and peripheral devices. Improvements in circuit speeds and arithmetic
capabilities have merely emphasized the ability of the computer designer to
create far more computational capacity in a central processor than can be
dispensed in any useful way. The latest hardware developments have con-
centrated on using processor-like resources to augment memory, input-output,
and peripheral facilities. Software development has brought about multipro-
gramming - computing alternately on several available tasks by dividing the
memory and input-output resources among them - which has helped to reclaim
a portion of the expensive but unused central processor capacity.
The thesis of this dissertation is that processing (computation) capacity
should be divided and dispensed on demand by using multiple asynchronous
processors of low capacity. We should no longer be intimidated by apparent
problems of control of large numbers of processors. Justification of this
position will be by construction of a plausible multiple-processor computer
system and by demonstrating the running of several example algorithms on it.
-l-
The construction has been made in a simulation context, and by varying
some of the design parameters, the limitations imposed by the organization
of this particular computer system have been measured.
In most computing systems built until recently, the central processor
has been of relatively high capacity. Therefore, multiple processor config-
urations have been considered reasonable only in “large systems17 - ones
where the computational burden is expected to be high. In order to extend the
validity of the multiple-processor concept to systems of smaller capacity, a
small discrete unit has been chosen. The processor used in the studies here
has the approximate computing power of a typical (under $5,000 today)
“mini-computer”. The choice of such a small processor makes the designer
immediately face the following trade-off: as the unit size of a resource gets
smaller, the significance of the overhead of allocation usually becomes greater
because more units must be allocated for a typical task. The system shown
here has a low time-overhead of processor allocation. In order to guarantee
that this is achievable, certain compromises have been made in the generality
of inter-processor communication.
What kind of parallelism is possible with an asynchronous, multi-
instruction-stream organization? The example algorithms shown here are all
intended to be typical of routines which implement a single (higher-level)
operation. In this sense, the algorithms are emulation routines, micropro-
gram implementations of higher-level language constructs. Whether the
reader chooses to call these constructs llmachine-language” operations will
depend on his viewpoint. In any case, the routines are Ynterpretivetf in the
sense that each one performs a single’ conceptual operation. The fact that
parallelism is demonstrated here only within the interpretive routine (i. e., at
-2-
the “micro” level) is not intended to rule out the possibility of further paral-
lelism at higher levels, by use of the more conventional techniques of co-
routines, multiprocessing, and multiprogramming. However, the oper-
ations implemented by these example routines are to be regarded functionally
as primitive operations in which the parallelism is out of the control of the
problem-programmer (the user of the operations). This idea will be foreign
to the software-oriented reader who has not concerned himself with the paral-
lelism and control problems in, say, the arithmetic unit of a processor which
is executing a “multiply” instruction. The processors in this simulated system
are intended to be a resource at the command of the microprogrammer, who
is implementing “machine-language” instructions by writing emulation routines
for them.
The example operations demonstrated here - two sorting methods and a
matrix multiply - are on the other hand rather lfhigh-level’l operations
compared to typical machine language instructions. Although the author is a
strong advocate of more complex l’machine-language” instructions, these
examples are not necessarily the top candidates for new general-purpose
machine language instructions. The hardware-design oriented reader is
encouraged to accept them as vehicles for demonstrating the modes of multiple
processor interaction possible in this system. The three main examples are
also intended to stress, respectively, the bounds on inter-processor communi-
cation, memory access, and computation resources.
Programming Context
Figure l-l, shows four of the levels of operation in a computer system
when a job is to be run. A system analyst or user describes the task to be
performed with a flowchart, shown at level 1. A programmer transforms the
-3-
P
::: ::: LH 14,“O..TAELEIlB~ ‘H ~4,YO..TbBLE1(71 BC 10,c.L. 1s
SORT I/ EN” ;
ENDi /* J-LOOP I, EN”; ,* I-LOOP r,
ENOi ,* SORT, *,
23 END
PROBLEM DESCRIPTION
FLOW CHART
LEVEL 1
HIGH-LEVEL LANGUAGE
PL/I
LEVEL 2
a.1 e,e 14,VO..TAELE1,81 L4,TEMP
LH 8.J \ AR BIB LH I4,TEHP STH ,4r”C..TbBLEL,R,
MACHINE LANGUAGE
IBM 360 ASSEMBLER
LEVEL 3
MICRO- CODE
360 MODEL 40
FIG. l-l. Four levels of computer operation.
flowchart into a set of procedures written in a high-level language. The
sorting portion of this example is shown coded in PL/l. The high-level
language is compiled into machine-language and executed. Shown in the figure,
at level 3, is a portion of the IBM System/360 machine code generated by
compiling the PL/l program shown at level 2. In a processor implemented
using microprograms, some or all of the machine-language instructions in-
voke a sequence of micro steps. At level 4 in the figure is shown a small
portion of a microinstruction sequenca invoked in an IBM microprocessor, in
carrying out a machine-language instruction (Husson [ 521, p. 290).
We pause to observe that “machine language” refers to a machine which
often exists only on paper and in the mind of the machine-language programmer.
We will use the word l’architecture” to refer tc the structure of this conceptual
machine, as in “IBM System/360 architecturel’. The term ?irtual machine”
is also used to refer to such a conceptual machine, but we avoid using this
term in the machine-language context because it is also used for other concepts.
The underlying microprogrammed hardware which implements a given
machine language may have widely varying structure. We will refer to this
structure as machine ~lorganization”. The activity of executing a sequence of
microinstructions which implement a machine-language instruction will be
called “emulation”. Emulation is a particular case of interpretive execution.
The software support necessary tc carry out the activity of Fig. l-l is
not shown. A compiler is necessary to transform the program at level 2’to
level 3. A loader and a linkage editor are necessary to put together the por-
tions of a program before execution. A sizable software environment is pre-
sumed when the program is running, to handle error conditions and file access.
-5-
At higher levels, executive control of job processing, and file and hardware
maintenance routines, are presumed to exist.
In the work described here, a specific multiple-processor computer
organization is proposed, and used to implement three different example
instructions which currently would be regarded as being at level 1 or level 2.
In particular, the SORT operation portrayed in Fig. l-l& Example 1, a
sifting or “bubble” sort algorithm. In this case, we have jumped over two of
the levels and have implemented directly in microcode an operation which has
a relatively large amount of processing to do and yet is a single conceptual
unit in the original problem statement. This type of level-jumping has been
discussed in previous work in Wirect execution” of high-level languages, in
which the compiling step of the transformation from level 2 to level 3 is
removed; and in the extension to direct microcode implementations of high-
level constructs.
The example operations shown in this work should be considered as part
of a large set of high-level primitive operations which would make up a
“machine language”. The algorithmic problem description, possibly at level 1,
would still require compiling into sequences of these primitive operations.
However, the more closely the primitive operations resemble the high-level
conceptual units, the easier it will be to generate compact representations of
these units in low-level code.
A stronger argument for implementing high-level constructs in microcode
is that execution time can be saved. This time saving is achieved at the
expense of some flexibility available with simpler processing primitives.
Another saving, however, is the removal or reduction in complexity of the
programming task at the next level up. The human activity of programming
-6-
is one of the largest components of cost in performing a computer-based task.
Any savings in the often-repeated job of high-level language coding can be
applied against the relatively infrequent, but now more sophisticated task of
microcode generation.
The increase in the difficulty of the microprogrammer’s task, a con-
sequence of t’elevatingn the machine language, is not to be skipped over lightly.
The microprogrammer has responsibility for properly controlling independent
functional units of the hardware, as well as producing an algorithm which is
guaranteed tc execute correctly for all possible timing and data conditions (or
at least: to execute correctly or detect errors in an orderly way for all
possible conditions). If it becomes impossible to find people who can write
such microcode for the high-level constructs chosen, the viability of the
machine organization proposed here - applied to such high-level operations -
is doubtful. However, we contend that a small number of microprogrammers,
well versed in both hardware control and algorithm design, can develop all the
primitive operations which may be required for special or general-purpose
computation. They will only require careful specification of the set of primitive
operations, and a proper set of criteria for success or cost measures.
The great inhibition to development of higher-level primitive operations,
other than the inertia of tradition, has been our inability to answer the question:
what are the primitive operations appropriate to the problem? Even when the
answer to this question is known, the answer is highly specific to the problem.
Therefore, a set of low-level operations with which one can “do anything” has
been the solution to both general-purpose and special-purpose computing.
Now that read-write microcode storage allows us to consider modifying the
machine language of a fixed hardware organization, and the market for
-7-
single-application (special-purpose) computers is large enough, we are ready
to consider designing specialized sets of instructions tailored to the application.
The instructions which are the examples in this work - two sort instruc-
tions and a matrix multiply - were chosen not so much for being prototypes
of an inevitable future instruction repertoire as for demonstrating the level of
complexity achievable in reasonable lengths of microcode, using plausible
underlying hardware. The choice of algorithms for implementing them was,
of course, influenced by the desire to elucidate the strengths and limitations
of this hardware organization.
The Multiple-Processor Approach
In large computing systems, the demand for reliability and for high peak
computing capacity has led to implementation of a number of multiple-processor
configurations. The cost of the relatively large processors in these systems
is typically justified by either the even larger investment in peripheral devices
or by a very high premium on peak capacity. The economics of computer
electronics today is now making it possible to consider implementing relatively
small-capacity computer systems with multiple-processors. In particular,
if the unit of “processor resource” is implementable on a few integrated-
circuit chips, the cost per processor (if they are identical) may become a very
small percentage of the total (hardware) system cost. (This economic consid-
eration is explored further in Chapter 4.) Because of this, computer manu-
facturers and others are beginning to propose systems composed of multiple
“mini-computers” (Dorff [ 441, Mutzeneek 14’71, Stern [ 711). The research
described here is intended to shed light on the questions which arise when a
-8-
computer system contains a number of identical processors, each with rela-
tively small computing capacity. Some of these questions are:
1. How can the computing power of the processors be combined to
speed up the execution of a given task?
2. By what mechanism are the processors to be allocated to a task,
and among competing tasks ?
3. How are the processors to communicate with each other?
4. To what extent is the multiplicity of processors to be reflected
in the programming language seen by the “user”?
In this study, we have answered %ot at all” to this last question, choosing
to isolate the parallelism within one programming level. Thus, a sort oper-
ation which may invoke more than one processor is seen by the user as a
primitive operation and a functional description of this operation would not
have to refer to its multiple-processor implementation method. The novelty
of the programming presented here is that the execution does not depend on
the availability of any specific number of processors.
In order to simplify the programming, a restriction on the communications
between processors has been imposed; this restriction also simplifies the
allocation of processors. Figure l-2 shows the functional connections which
may be made by an executing processor. Not shown are the allocator mecha-
nism, which creates the connection to a “slave” processor, or the micropro-
gram store, which provides an instruction stream to each executing processor.
A complete description of the Z-machine (as this model is called) is given in
Chapter 2.
By restricting the interconnections between processors to this linear
Tmary tree”, as it might be called, the programming of a task is constrained
-9-
t + PM
A MAIN
MEMORY v
4 + P
A
“MASTER”
“SLAVE”
2106AZZ
FIG. l-2. Programmer’s view of processor connections.
-IO-
in several ways. First, the symmetry of the chain of processors creates an
incentive to write re-entrant code, causing each processor to behave like its
neighbor; this also, incidentally, makes debugging easier. Second, the
instruction set of each processor need contain only three different input source
and output destination references (four, if the allocator is included). This
allows relatively simple testing and state-keeping during execution; it also
simplifies the hardware design. Finally, a processor may execute a detach
instruction to remove itself from the chain it is in. In this way, completely
independent processing is possible. This mechanism was not however used in
any of the three examples here.
The central thrust of this work is to provide a demonstration of one
answer to question 1: how to combine the computing power of many small
processors. By restricting the hardware and software contexts to a particular
configuration with a specified set of instructions, we have been able to con-
struct an operating model of a plausible multiple-processor computing system.
This system (the “Z-machine”) has been implemented, with a maximum of
15 processors, using the CISCO simulator (see Appendix A). Three example
algorithms have been coded and run on the simulated Z-machine. By varying
the maximum number of processors in the system and gathering data on the
run time and many other measures, we have been able to answer the following
questions within the context of the Z-machine:
1. Can multiple-processor algorithms be coded in a reasonable length
of (microprogram) memory?
2. How does the run time of a problem change with increasing numbers
of available processors?
-ll-
3. Is there a practical limit on the number of processors which can
be used? In what way does the limit depend on the nature of the
algorithm ?
4. How do we determine the optimal number of processors, based
on their cost?
The example algorithms and their implementations are described in
Chapter 3. The measurement parameters and the major results are pre-
sented in Chapter 4. In addition to the questions above, the measurements
have led to conclusions regarding the performance tradeoffs of the hardware
organization. The results in Chapter 4 give answers to the following questions:
1. Is a common data bus a viable means of communication among
processors and memory?
2. What bus capacity (bandwidth) is necessary to avoid serious
bottleneck phenomena?
3. What happens to efficiency, run time, and parallelism as the
bus becomes a bottleneck?
4. What happens to efficiency, run time, and parallelism as main
memory speed varies.
Literature Survey and Annotated Bibliography
In this section we review the origins of some multiple-processor concepts
and attempt to identify those thoughts in prior published work which point
toward asynchronous low-level parallelism.
We begin by citing separately a body of work which begins with the well-
known paper by Aschenbrenner, Flynn, and Robinson [ 31 and ends - in this
survey only - with an article by Flynn and Podvin [ 14). Professor Flynn
has done much to classify the various approaches to computer organization;
-12-
his design preferences have, however, served more as a stimulant than as a
model for our work.
We include in this group a series of papers on the IBM System/360
Model 91 which, with its asynchronous arithmetic units and common data bus
in the floating-point units, has also stimulated our interest in multiple pro-
cessor systems. The Model 91 also has a special significance; we used slightly
over 23 hours of CPU execution time on the Model 91 at the Stanford Linear
Accelerator Center in generating the simulation results presented here.
We also include here two papers by Chen [ 6,7] in which he has defined
several of the concepts we. have used, such as the “space-time” of a job,
which we call “area”.
1. Anderson, S. F . , Earle, J. G. , Goldschmidt, R. E., Powers, D. M.,
The IBM System/360 Model 91: floating-point execution unit, IBM J.
R&D (Jan. 1967), 34-53.
Pipelined, independent execution units I’. . . provide
concurrent instruction execution at the burst rate of one
instruction per cycle. l1
2. Anderson, D. W., Sparacio, F. J., Tomasulo, R. M., The IBM System/
360 Model 91: machine philosophy and instruction-handling, IBM J. R&D
(Jan. 1967), 8-24.
3. Aschenbrenner, R. A., Flynn, M. J., Robinson, G. A., Intrinsic
multiprocessing, SJCC (1967), 81-86.
Fairly detailed description of multiple-processor effect
by sharing arithmetic and other functional units among many
pseudo-processors (instruction-decode streams). Brief
description of executive functions. Very centralized execu-
tive strategy. Passing mention of intra-program parallelism.
-13-
4. Boland, L. J., Granito, G. D., Marcotte, A. U., Messina, B. U.,
Smith, J. W., The IBM System/360 Model 91: storage system,
IBM J. R&D (Jan. 1967), 54-68.
5. Chen, T. C., The overlap design of the IBM System/360 Model 92
Central Processing Unit, FJCC (1964), Part II, 73-80.
Early discussion of Model 91 (to be) organization.
6. Chen, T. C., Unconventional superspeed computer systems, SJCC
(1971), 365-371.
‘Job profile’, ‘space-time’ and ‘parallelism ratio’
definitions; lmicro-multiprogrammingl: “the workload
characteristic to be exploited is job independence, rather
than the much more restricted job parallelism.”
7. Chen, T. C., Parallelism, pipelining, and computer efficiency, Computer
Design 10, 1 (Jan. 1971), 69-74.
Includes lmicro-multiprogramming, f automatic smoothing
of low level congestion, such as in a cache.
8. Cook, R. W. and Flynn, M. J., System design of a dynamic micro-
processor, IEEE Trans. C-19 (Mar. 1970), 213-222.
“DMP” machine design - an extremely flexible pro-
cessor with writable micro memory. Microinstruction
sequencing determined in each microinstruction; examples
of table search, division, square root; comparison with
System/360. Assembler language for microprograms
defined.
9. Flynn, M. J., Very high speed computing systems, Proc. IEEE 54, 12
(Dec. 1966), 1901-1909.
-14-
Classifies computer systems with respect to single/
multiple instruction and data stream, and discusses known
techniques for augmenting the performance of these systems
and their limitations. Defines “bandwidth”, YatencyYt,
%onfluence”. ‘I. . . a restricted MIMD (multi-instruction
stream, multi-data stream) may be organizationally much
simpler than a confluent &SD”.
10. Flynn, M. J., Organization of computing systems, Notes from Computer
and Program Organization, University of Michigan Engineering Summer
Conference, 1970.
11. Flynn, M. J., Shared internal resources in a multiprocessor, IFIP 71,
TA-4, 7-11.
Improved usage of hardware resources by having multiple
“skeletonn processors (interpreters) in time-divided rings.
Conditional fork (“as available”) suggested. Wub-commutation”
described to recover otherwise unused time slots. Rotating
priority among rings guarantees minimum access to physical
resources.
12. Flynn, M. J., MacLaren, M. D., Micro-programming revisited, ACM
NC (1967), 457-464.
A broad discussion of the whys and hows of micro-
programming. Good reading.
13. Flynn, M. J., Low, P. R., The IRM System/360 Model 91: some remarks
on system development, IBM J. R&D (Jan. 1967), 2-7.
14. Flynn, M. J., Podvin, A. , Shared resource multiprocessing, Computer
(Mar/Apr 1972)) 20-28.
Shared functional units include multiple instruction
decoding units.
-15-
15. Flynn, M. J., Pod&n, A., Shimizu, K., A multiple instruction stream
processor with shared resources, in ~~Parallel processor systems,
technologies, and applications,” ed. by L. C. Hobbs et al., Spartan --
Books (1970).
16. Schwartz, J. , Large parallel computers, JACM 13, 1 (Jan. 1966), 25-32.
“Athene” type, multiple instruction location counter
machines.
17. Senzig, D. N., Observations on high performance machines, FJCC (1967),
791-799.
Shared functional units among multiple CPUs.
18. Tjaden, G. S. , Flynn, M. J. , Detection and parallel execution of
independent instructions, IEEE Trans. C-19, 10 (Oct. 1970), 889-895.
19. Tomasulo, R. M., An efficient algorithm for exploiting multiple arith-
metic units, IBM J. R&D (Jan. 1967), 25-33.
Common data bus in the floating-point unit.
20. Tucker, A. B., Flynn, M. J., Dynamic microprogramming: processor
organization and programming, CACM 14, 4 (Apr. 1971), 240-250.
“A dynamically microprogrammed processor is charac-
terized by a small (4k 64-bit word) read-write microstorage. ”
(Abstract) 50 nsec control, 500 nsec main fetch times.
3 examples of microcode integer multiply, linear table search,
Fibonacci number generation. “These operations are primitive
in the sense that executing a microinstruction requires no
sequential interpretation of an fop code’ by the execution unit. ”
“Microstorage words are assigned addresses of 0 through 4095
and main storage words are assigned addresses of 4096 and
-16-
higher. I1 16-deep microsubroutines. I’. . . judicious selection
and efficient microprogramming of a problem oriented macro
set merit extensive research. It
The next group of papers covers a variety of opinion concerning computer
organization. We find support of Flynnls shared functional units in Allen and
Pearcy [ 211. But a contrary stream of development supports the idea of
having complete, independent processors in anonymous pools. Conway [ 221
proposed a system in which nprocessors have no identity of their ownn
Gountanis and Viss [ 271 similarly advanced a “multiprocessor system . . .
(which) achieves complete processor anonymity by eliminating any processor
specialization for arithmetic processing, input/output command, and interrupt
handling. ” Curtin [ 231 also argued that “each computer should contain all the
conventional equipment for instruction execution. It should be possible to
program various computers to provide a wide variety of functions which
precludes a built-in division of labor among the identical computers of a logical
or arithmetic nature. I1
We find the idea of multiple asynchronous processors in many places,
including Curtin [ 231, Lehman [ 291 and Rosenfeld [ 331 and Rosenfeld and
Driscoll [ 341; these articles inspired us to undertake the simulation-by-
instruction-execution approach. As Lehman [ 291 said, “It is the insight and
understanding gained from the design of simulation experiments and the analysis
of their results that draws attention to specific details and difficulties. ‘I
The coupling of multiple processors with microprogramming appears in
Davis and Zucker [ 251, Lesser [ 311, and Stevens [ 351 . This leads us to ask
at what level of programming we should apply the processors. Gosden [ 261
assumed that tlprogrammers will specify parallelism only if it is easy and
-17-
straightforward to do so.” Allen and Pearcy [ 211 said, “the tendency is for
programmers to avoid functions which are not easily grasped . . . . A pos-
sible solution is . , . direct implementation of many functions needed in higher
level languages. I1 Curtin [ 231 adds, “For multiple computers, the success
or failure of a hardware organization may rest solely upon the ease of
programming. . . . ”
For these reasons, we have chosen to try to elevate machine language
operations to form easily used constructs. Davies’ [ 241 broad review of
microprogramming summarizes, Yt is fairly clear that, what ever the opti-
mization [of instruction repertoire] methods might be, they should not
require that users actually microprogram nor even understand micropro-
gramming. Implementation should be through disciplined and well-controlled
combinations of architecture, language processor, and operating system
services, with the user’s interface as straightforward and as far from the
detailed microcode level as possible . . . . It In particular, we do not look for
user-defined parallelism to give us a breakthrough in machine parallelism; as
Gosden claimed, for example, the FORK-JOIN in high-level languages “(do)
not show great gains. l1
Regularity of structure in the hardware, from LSI considerations, is
very desirable. Allen and Pearcy [ 211 argue well for this, and claim, “A
desirable aim is to use the increase in logic content to reduce software
complexity as well as improving execution speed. I1 Correlated with regularity
of structure is a need to keep the control from becoming too complex.
Gosden [ 261 says, “A simple control scheme is needed to provide straight-
forward scheduling. l1 And further, “Agood control scheme . . . must be simple
so that the overheads of implementing it do not seriously erode the advantages
-18-
gained. Second, it should be efficient with simple ad hoc scheduling algo-
rithms. IT Low overhead means fast allocation. Gunderson, Heimerdinger,
and Francis [ 281 presented a system in which ‘I. . . the capability of per-
forming allocation and assignment functions rapidly is used to provide
increased flexibility by making the duration of processor assignments (the
length of segments) relatively short so that processors are frequently avail-
able for reassignment. ”
Amortization of allocation overhead over the length of the parallel paths
can become important. As Gosden [ 261 said, “major benefits will accrue only
if . . . either many parallel paths exist or, if the parallel paths are few, they
must be long. It Conway’s [ 221 system also envisioned possible long paths:
Tn fact, external interrupts are unnecessary. Consider that an I/O instruc-
tion is simply a very lengthy one which may be executed in parallel with other
code . . . .I’
On the other hand, Conway also leads us to the idea of multiple processors
working on one task: ‘I. . . notice that the FORK-JOIN approach provides no
justification for distinguishing between parallelism within a program and
parallelism between programs. t1 Conway was not talking here about micro-
code, but the concept still applies. The whole approach of Lehman [ 291 is of
multiple processors on one task. Gosden [ 261 also pointed out “savings in
internal system overheads when parallel activity on one task replaces parallel
activity on different tasks. If
We have gone the direction of providing our system with many processors
of low computational capacity. Mallach [ 321, in a guidance computer study,
said, ‘I. . . it was found that a multiprocessor with many slow processors is
more efficient than one with a few fast processors. If Gunderson, Heimerdinger,
-19-
and Francis [ 281 gave the converse argument. “In fact, the need for parallel
processing is more important in a system with relatively weak processing
modules in order that it be able to execute a high-priority program in a rea-
sonable length of time. ”
The only previous application we have found of a variable number of
processors to a single problem is in Lehman [ 291 and Rosenfeld [ 331.
Lehman argued, “Within a system in which a maximum of p processors are
available to a job, it is pointless to partition a job, at any one time, into more
than p tasks. It is, however, undesirable to guarantee a user that p pro-
cessors, or even more-than one processor, will execute his program. 11
We also find in Gosden [ 261, “Processes generate parallel paths dynami-
cally one at a time.. . . Distribution of parallel paths among varying num-
bers of processors is an easy ad hoc process. ” Finally, shared code among
processes is mentioned by Conway [ 221: ‘I. . . the same section of code can
be executed simultaneously in two parallel paths. It
21. Allen, M. W., Pearcy, T. , Developments in machine architecture,
Proc. Fourth Australian Computer Conference (1969), 227-230.
Good general motivation for multiple-microprocessor
approach, starting from LSI considerations. 24 references.
22. Conway, M. E. , A multiprocessor computer design, FJCC (1963),
139- 146.
23. Cm-tin, W. A., Multiple computer systems in Alt, F. L. and Rubinoff, M.
(eds.), Advances in Computers, Vol. 4, Academic Press (1963), 245-302.
A good introduction to and motivation for multiple identical-
processor systems. States some criteria of “success”, also
refines the definition of “multiple computer”. Curtin’s multiple
-2o-
processor model does not provide for simple processor-
processor transfers, and the programming seems to bog
down due to this. Contains a programming example; plus
a survey of existing (in 1963) multiple computer systems.
Considerable explanation of possible bus structures.
Sample of excellent “motivation” section: “if a multiple
computer is worth building at all, it should at least solve
some class of problems within some specified time limit:
which limit would be exceeded by any single computer working
on the same task. Also . . . it should, at least some of the
time, be used to execute a single task which requires all or
most of the available equipment.” “The question to what
extent a multiple computer utilizes its hardware more
efficiently is, however, a secondary to the time factor. I’
24. Davies, P. M., Readings in microprogramming, IBM Systems J. 11.
1 (1972), 16-40.
An enlightening and clearly written tutorial, with
annotated references to all the essential articles through
1970. “The fact that a machine is running at maximum
storage data bandwidth does not necessarily imply that a
particular programmed task is being executed at the maxi-
mum rate that could be achieved . . . (varying) the organi-
zation and representation of . . . (the program) and its
associated data. ‘I (p. 27)
25. Davis, R. L., Zucker, S., Structure of a multiprocessor using micro-
programmable building blocks, SIGMICRO Newsletter (ACM) (Oct. 1971),
27-42.
-21-
Burroughs %terpreter” described in fair detail, with
claims for multiple processor configurations. “The Switch
Interlock is the means for interconnecting virtually any
desired number of Interpreters into a unified system. . . . ‘I
The microprogramming design techniques are very interesting.
26. Gosden, J. A., Explicit parallel processing description and control in
programs for multi- and uni-processor computers, FJCC 66, 651-660.
Good survey of the approaches to parallel processing.
Goes into most detail for the FORK-JOIN, and parallel
FOR constructs for high-level languages. Emphasis, in
this part on sharing re-entrant code.
27. Gountanis, R. J. and Viss, N. L., A method of processor selection
for interrupt handling in a multiprocessor system, Proc. IEEE 54, 12
(Dec. 66), 1812-1819.
1, . . , the multiprocessor system is based on a pool of
anonymous processors. If “The supervisory or executive
program is maintained in the shared storage (main memory)
and is executed by any or all processors as required within
the system. I’ “In order for a system to degrade gracefully,
either its various units must be duplicated, . . . or the units
must be failproof. All three approaches are represented
in the system described here. ”
28. Gunderson, D. C., Heimerdinger, W. L., Francis, J. P., A multi-
processor with associative control, in “Prospects for simulation and
simulators of dynamic systems, I1 183-200 (see CR 13916) Shapiro, G.
andRogers, M. (eds.), Spartan Books (1967).
-22-
(Dynamic control over processor assignments.)
Multiple processors in a conventional organization except
for the use of associative memories to aid in allocation of
processors, segmentation, etc. Gives argument for short
segments, causing frequent reallocation of processors.
29. Lehman, M., A survey of problems and preliminary results concerning
parallel processing and parallel processors, Proc. IEEE (Dec. 1966),
1889-1901.
“After an introduction which discusses the significance
of a trend to the design of parallel processing systems, the
paper describes some of the results obtained to date in a
project which aims to develop and evaluate a unified hardware-
software parallel processing computing system and the
techniques for its use.” (abstract)
30. Lehman, M., Serial mode operation and high-speed parallel processing,
IFIP (1965), Part 2, Spartan Books (Oct. 1966), 631-632.
“The relative efficiency of a multiprocessor operating
in a serial mode and a conventional processor operating in
parallel depends critically on the extent to which macro-
parallelism (multiprocessing) is achievable in the former
microparallelism (instruction look-ahead) in the latter.
Only the accumulation of significant operating experience
can decide these questions. It
31. Lesser, V. R., An introduction to the direct emulation of control
structures by a parallel microcomputer, IEEE Trans. C-20, 7
(Jul. 1971), 751-764.
-23-
A very general model of emulation and instruction
execution is developed within a context of multiple asyn-
chronous microprocessors as the underlying support.
32. Mallach, E. G., Analysis of a multiprocessor guidance computer,
Report T-515, Instrumentation Laboratory, M. I. T., Cambridge, Mass.
(June 1969) (thesis).
33. Rosenfeld, J. L. , A case study in programming for parallel-processors,
CACM 12, 12 (Dec. 1969), 645-655.
Exhibits application of multiple processors using a
common memory to a network calculation (using a “chaotic
relaxation” method). Basic processor modeled is an abbre-
viated System/360 with some additional instructions: Fork
(‘%plit7’), Join (“Terminatelt), Test and Set, Branch on Index
in Storage (cf. , Dijkstra’s V(x)). Interesting results are
presented on (1) problem run time, (2) storage interference,
(3) processor and storage utilization, and (4) a run-time-
related “quality factoP. The work reported is carefully
delineated and the paper contains cogent comments relevant
to more general problems.
34. Rosenfeld, J. L., Driscoll, G. C. , Solution of the Dirichlet problem on
a simulated parallel processing system, IFIP (1968), Software 2,
Booklet C, 24-28.
“Four relaxation algorithms were programmed . . . on
an instruction-executing simulator for a parallel processing
system. I1
-24-
35. Stevens, S. L., A multiprocessed, microprogrammed processor con-
figuration with engineered processor pools, Technical report, Bell
Laboratories, Naperville, Illinois (1971), 5pp.
A rather general proposal advocating both micropro-
gramming and multiple processor configurations.
The next group of papers is somewhat peripheral to this work. They
have to do with higher-level language implementation, parallelism in high-
level languages, and language constructs for the control of parallel processors.
36. Bjorner, D., Review 22,362, of “SYMBOL - a large experimental sys-
tem exploring major hardware replacement of software, I1 by
W. R. Smith etg., SJCC (1971)) 601-616, Computing Reviews
(Dec. 1971), 580-581.
II . . . to get at high-level language machines one may
do better by implementing . . . the primitives best suited
for compile-time language translation and execution-time
language interpretation. 11
37. Firestone, R. M., Parallel programming: operational model and de-
tection of parallelism, Ph.D. thesis, New York University (May 1971).
I, . . . attention is focused on multiprocessor hardware
and programs which have portions executed concurrently. . . .
The degradation due to storage interference is analyzed . . .
in the context of a block-structured language with parallel
execution. It
38. Iliffe, J. K., Basic machine principles, American Elsevier Publishing
Co. (1968), 86 pp.
-25-
39. Meggitt, J. E. , A character computer for high-level language inter-
pretation, IBM Systems J. 3, 1 (1964), 68-78.
“The design was simulated on standard equipment. I1
40. Opler, A., Procedure oriented language statements to facilitate parallel
processing, CACM 8 (May 1965), 306-307.
Demonstrates a plausible approach to parallelism in
high-level language source code - also shows the restric-
tions necessary for implementation.
41. Walden, D. C., A system for interprocess communication in a resource
sharing computer network, Proc. 4th Hawaii International Conference on
System Sciences (1971), 640-642.
Short discussion of generalized communication
primitives; possibly applicable to one-site multi-
processor systems.
The group of papers next below shows some of the applications of multiple-
processor systems.
42. Bauer, W. F., Why multi-computers?, Datamation 6 (Sep. 1962), 51-55.
Early discussion of the motivation for multiple pro-
cessor systems.
43. Bell, G., Freeman, P., ett., C. ai: A computing environment for
ai research - Overview, PMS and operating system considerations,
Department of Computer Science, Carnegie-Mellon University, Pittsburgh,
Pa. (May 1971).
Proposal for a large, parallel machine for research
in artificial intelligence.
-26-
44. Dorff, E. K., A multiple minicomputer message switching system
Computer Design 11, 4 (Apr. 1972), 67-73.
45. Langley, F. J., The universal function unit concept for computing
applications, Computer Design 11, 4 (Apr. 1972), 87-91.
“Single bus multicomputer system,tl Fig. 4, p. 89.
46. Miller, W. F. , Aschenbrenner, R. , The GUS multicomputer system,
IEEE Trans. EC-12 (Dec. 1963), 6’71-676.
“As many as seven independent computers or devices
are allowed to operate on the memories. . . . ” Application
to graphics, pattern recognition research.
47. Mutzeneek, J. I., Using mini-computers in systems engineering,
Electronic Engineering (Feb. 1971)) 45-47.
“Separate processing units . . . under the control of
another mini-computer. ”
48. Rohrbacher, D. L. , Advanced computer organization study, final
report, Report RADC-TR-66-7, Vols. I and II, Rome Air Development
Center (Apr. 1966), AD 631870 (Vol. I), AD 631871 (Vol. II).
High-level investigation, with specific machines
designed for sorting and parallel compilation. Involved
in the project: J. Holland, K. E. Batcher, C. C. Foster,
as well as the author cited.
49. Wald, B. , Utilization of a multiprocessor in command and control,
Proc. IEEE 54, 12 (Dec. 1966), 1885-1888.
General discussion of the D-825 and some imple-
mentation problems and approaches.
-27-
The books cited below are good references for computer design in
general.
50.
51.
52.
53.
54.
55.
Bell, C. G., Newell, A., Computer structures: readings and examples,
McGraw-Hill (1971), 668 pp.
Buchholz, W. (ed.), Planning a computer system - project STRETCH,
McGraw-Hill (1962)) 322 pp.
Husson, S. S., Microprogramming principles and practices, Prentice-
Hall (1970).
Lorin, H., Parallelism in hardware and software - real and apparent
concurrency, Prentice-Hall, Inc. (1972), 508 pp.
Sbarpe, W. F., The economics of computers, Columbia University
Press (1969).
Thornton, J. E., Design of a computer - the Control Data 6600,
Scott, Foresman and Co. (1970), 181 pp.
A terse historical, motivational, organizational and
functional description of the 6600 by the man who “was
personally responsible for most of the detailed design. ”
Includes some circuit and logic design, especially of
arithmetic units.
The remaining group of papers, below, cover a variety of topics related
to multiple processor computer systems.
56. Baskin, H. B., Horowitz, E. B., Tennison, R. D., andRittenhouse,
L. E. , A modular computer sharing system, CACM 12, 10 (Oct. 1969),
551-559.
11 . . . a general purpose interactive multi-terminal
computing system is presented. . . , a bank of interchangeable
-28-
57.
58.
59.
60.
61.
computers, . . . are assigned to process terminal jobs as
they arrive. ‘I (abstract) Uses a crossbar switch, master
(control) processor.
Buchman, A. S., Aerospace computers, in Alt, F. L. , and Rubinoff, M.
(eds.), Advances in Computers, Vol. 9, Academic Press (1968), 234-284.
Emphasis on reliability; error recovery; packaging.
Burnett, G. J., Koczela, L. J. Hokom, R. A., A distributed processing
system for general purpose computing, FJCC 67, 757-768.
Ideas for arrays of processors, a la Solomon, are
partially developed. %aturalt’ and “applied” parallelism.
Develops some approximate curves - relating storage per
cell to parallelism and efficiency vs. parallelism.
Codd, E. F., Multiprogramming, in Alt, F. and Rubinoff, M. (eds.),
Advances in Computers, Vol. 3, Academic Press (1962), 77-153.
Multiple processing units, at high level.
Duggan, P. N., Multiple instruction stream microprocessor organization,
IBM Technical Disclosure Bulletin 14, 9 (Feb. 1972), 2739-2740.
‘IAn instruction pipelining technique provides for execu-
tion of micro-instructions simultaneously from control
storages of four logical micro-processors. . . .I’
Kartsev, M. A., On the structure of multiprocessor systems, IFIP 71,
Booklet TA-4 (“Hardware and Systems”), l-6.
Classifies 4 types of multiprocessor systems: (1) asyn-
chronous, (2) synchronous with separate control, (3) common
control but separate instructions, (4) common instructions.
Proposes a combination of all four, using ~~bund.lest~ of closely
connected, synchronous processors communicating
asynchronously with each other. Some talk on memory
organization, problem suitability is included.
62. Koczela, L. J., The distributed processor organization, Advances in
computers, Vol. 9 (1968), 285-353.
Elaborate system of processors (in groups) which can
operate in concert or independently, at many levels of hier-
archy. Much attention paid to reconfigurability, I/O
structure, transition among operating modes. Intermodule
connections are described in detail. No example programs
are shown in this paper. It appears that the software may
be very complex.
63. Leiner, A. L., Notz, W. A., Smith, J. L., Marimont, R. B., Con-
currently operating computer systems, Proc. Int*l Couf. on Information
Processing UNESCO, Paris, 15-20 June, 1959 (R. Oldenbourg, Miinchen,
1960) 353-361.
Three-processor NBS tlPILOTt’ system described.
Specialized processors, messages between them, handling
of 2 level storage system explained in considerable detail.
Some Boolean matrix techniques for detecting program step
dependencies explained.
64. Macnaughton, P. C. , Virtual multiprocessors, CIPS Magazine, 12-14,
20-24.
Classifies ~~multisystems, 71 primary objectives are
self-repair, reliability, expandability, modularity.
-3o-
65. Macnaughton, P. C. , Lee, E. S. , The virtual multiprocessor - a new
computer architecture, IFIP (1968), Hardware 2, Booklet E, 39-43.
“Task managers, execution units, communication units,
and assignment units” are the four module types.
66. Miller, J. S., Lickly, D. J., Kosmala, A. L. , Saponaro, J. A. ,
Multiprocessor computer system study, final report (NASA-CR-108654)
(March 1970), 165 pp.
Excellent survey of 29 multiple processor computers.
Design study with emphasis on graceful degradation and
reliability for spacecraft application. Good review of
computer architectural considerations, in general and
specifically for reliability. Memory allocation, interrupts,
stacks, microprogramming all discussed in review (Ch. 3).
High-performance, reliable design schemata presented in
Ch. 5; uses buffer memory, microprograms in read-write
storage, stacks, relocation of memory pages; redundancy.
Appendix A presents survey of segmentation and paging on
7 existing machines.
67. Potvin, J. N., Chenevert, P., Smith, K. C., Boulton, P., Star-Ring:
a computer intercommunication and I/O system, IFIP 71, TA-4, 156-161.
Describes hardware used to interconnect several
computer systems and I/O devices, overcoming some
of the usual bus difficulties by constructing a high-speed
ring bus, with ports onto more normal busses. Cyclic
access to ring. A five port prototype was implemented.
-31-
68. Rice, R., LSI and computer system architecture, Computer Design 9,
12 (Dec. 1970), 57-64.
69. Riegel, E. W. , Parallelism exposure and exploitation in digital computing
systems, Report TR69-4, Burroughs Corp. (Apr. 1969), 255 pp. (thesis)
“Language constructs are suggested that provide explicit
indication of parallelism at the task level (routines and repeat
statements). ” “Multiple computer systems are examined and
compared based on homogeneity and interunit communication. I’
Good survey and review.
70. Smith, P. A., Variable architecture computer, IBM Technical Dis-
closure Bulletin, 13, 9 (Feb. 1971), 2777-2778.
“A computer is illustrated having an architecture that
may be restructured by microprogramming. ” “The multi-
processing configuration, . . . effectively (divides) the system
into two separate processors.. . . If There is no restriction
as to which main store supplies microinstructions to the
control registers. I1
71. Stern, L., PDP-11 parallel multiprocessor, DECUSCOPE 9, 5
(Dec. 1971), 7-8.
Suggests how to hook up multiple PDP-11s using the
Unibus as a common bus.
72. Thompson, R. N. and Wilkinson, J. A., The D825 automatic operating
and scheduling program, SJCC 63, 41-49.
Parallel processing of independent tasks.
73. Ware, W. W., The ultimate computer, IEEE Spectrum (March 1972),
84-91.
-32-
lfConceptually, a multistream organization looks
attractive because it executes concurrently more than
one stream of operations, but experience with such
machines is limited. Practically, present multistream
designs are constrained to the concurrent execution of
the same operation, but on several strings of operands. I1
TV. 89)
Abbreviations used in the Bibliography:
SJCC Proc. AFIPS Spring Joint Computer
Conference
FJCC Proc. AFIPS Fall Joint Computer
Conference
IBM J. R&D IBM Journal of Research and Development
IEEE Trans. C- IEEE Transactions on Computers,
Volume C-
IFIP un Proc. IFIP Congress (19nn)
ACM NC Proc. ACM National Conference
JACM Journal of the ACM
CACM Communications of the ACM
-33-
II . . . multiprocessing organization has to this point seemed to have had little effect upon the architecture of the processors that exist in the system. ”
Harold Lorin, t’Parallelism in Hard- ware and Software” (1972)
CHAPTER 2
COMPUTER HARDWARE DESCRIPTION
The Multiple-Processor Model
The choice of a specific hardware environment for trying out a multiple-
processor approach to computing is, to a great extent, a matter of taste. We
feel strongly, however, that it is worthwhile to choose a particular config-
uration which will impose restrictions of the type which would be encountered
if the system were implemented physically. Some of these restrictions will
be found to have no effect on the main results; others have possibly limited the
scope of validity of the results.
Following are some of the prejudices and presumptions which led to the
selection of the configuration described in this chapter.
1. The economics of integrated circuits demands maximum repeti-
tion of few parts. All processors should therefore be truly
identical and interchangeable. (An arithmetic capability was
rejected on this basis. )
2. Serious consideration should be given to keeping the number of
pins (wiring connections) on a processor to a minimum. Having
chosen to place the microprogram store external to the pro-
cessor, this led to the choice of a very narrow microinstruction
width - 8 bits.
- 34-
3. Related to 2, above, no great emphasis should be placed on
minimizing the number of gates in a processor.
4. Allocation overhead is presumed to be low. In attempting to
make reasonable use of multiple processors within the exe-
cution of presumed machine-language instructions, it is clear
that processors must be attached and detached quickly. The
chosen model is able to allocate a free processor in a time
equal to two microinstruction cycles (for typical parameter
settings).
5. Correlated to 4, above, we expect and desire to have long
sequences of microinstructions. In any tradeoff, for example,
between a greater number of pins (required for addressing
capability) and longer microprograms, we have leaned
strongly toward the latter.
6. The microinstructions which request allocation and which
perform interprocessor communication are to be simple and
direct. The desire for simplicity (in both hardware and
programming) led to the restriction of a maximum of only
one llslaveJ’ processor allocated to a processor. This
restriction also leads to a desirable simplicity in the allo-
cator mechanism.
7. Timing of microinstructions should be uniform, even though
the separate processors are not synchronous. This led to
the omission of multiply and divide instructions, and to the
decision to avoid trap and interrupt mechanisms.
-35 -
8. Since all processors are identical, we should make them
‘lanonymousf’ so that no dependence on physical processor
number can arise. For this reason, physical processor
numbers are not made available to the microprogrammer.
9. The choice of microinstructions, other than interprocessor
communication and allocation primitives, is not extremely
important. An intuitively arrived-at set of instructions was
modified only slightly during the course of development.
(The shift amounts were changed, and a store register was
substituted for Exclusive OR,)
Note that the allocation mechanism used in this organization does not
depend on having a bus type of communication path. The allocator could as
easily reside within a crossbar switch arrangement. The main feature of
the allocator from the programming point of view is that a processor exe-
cuting a FORK sends a single message to “processor zero”; the allocator
fills in the number of a destination processor, if available, and generates an
acknowledgment message.
This organization is particularly applicable to computation on problems
which can be structured to have relatively little global interaction. For ex-
ample, it is applicable to problems in which a chain of processors can
operate on data in pipeline fashion. Example 1 has made a bubble sort al-
gorithm into such a program, which we call a “sifting” sort. Other problems
for which it would work well are those with generally independent activity.
Example 3, a matrix multiply, is one such problem. Others are: some file
search algorithms, relaxation methods for partial differential ecpation
solution, top-down parsing algorithms, some pattern-recognition methods,
-36-
and simulation programs which have parallel processes. Certain input/ output
operations and other computer operating system functions could be handled by
independent processors.
This organization is not especially well suited to problems which have
complex data interaction. For example, the Fast Fourier Transform algo-
rithm would invoke a great deal of coordination overhead through main memory
due to the limited connectivity of the processors. Example 2, the Shell sort
encounters this kind of difficulty, some of which was resolved by using some
of the data locations for coordination semaphores. In general, this organi-
zation relies to a high degree on the skill of the microprogrammer in finding
an appropriate structuring of the data and control mechanisms needed.
Some particular limitations of the model described in this chapter are
the following:
A. The choice of a common data bus to interconnect the processors and
main memory does not affect the programming of the algorithms, but it does
have some other consequences.
1. By varying the speed of the common bus, we have been able
to study the effects of varying bus capacity, but since all main
memory communications are transmitted across the bus, it
was not possible to reduce the memory access time below two
bus cycle times. Thus, for example, it was not possible to
simulate the pure case of “infinite” memory speed coupled
with a slow bus cycle time.
-37-
2. The speed of the common bus priority mechanism, which
scans the memory modules and processors, limits the
overall capacity of the bus. Although the scanning
mechanism is practical with 15 processors and four
memory modules, it would not be satisfactory for, say,
100 processors.
-38-
B. This model does not reflect any of the interference in microprogram
memory access which would occur if a common memory were actually imple-
mented. This deliberate simplification has been made to keep the artifacts
of microprogram access from influencing the other design studies. We feel
that a practical, LSI-implemented system would probably have separate
microprogram storage for each processor; this would avoid the access inter-
ference, but introduce the problem of microcode duplication.
C. No special input/output processors or any provision for input/output
is developed in this model.
Plausible extensions of the model would be:
1. Addition of one or more special processors which would have
connections to external devices or channels.
2. Separate ports to main memory which would give access to
I/O channels. All communication would be through main
memory.
3. Addition of control units which would be activated by addressing
special memory locations, in the manner of the DEC Unibus
[ “PDP-11 Handbook, ” Digital Equipment Corporation (1969)] .
Description of the Z Machine
The multiple processor system configuration chosen for study is dia-
grammed in Fig. 2-l.
The Z-machine has been described and simulated in the CISCO language.
The description is divided into four functional blocks (called r’processes’t
in CISCO): A processor module, a data bus, a main memory module, and a
microprogram store. Four main memory modules and fifteen processor
modules are used in the studies reported here. The bus contains the processor
-39-
M M M M MAIN MEMORY
ALLOCATOR DATA BUS
MICROPROGRAM STORE
FIG. 2-l. Z-machine configuration.
allocation mechanism (shown as block “A” in Fig. 2-l) and also contains
most of the statistics-gathering instrumentation and trace-generating state-
ments . Each processor connects to the microprogram store and to one port
of the data bus. A single microprogram store serves all of the processors.
The main memory modules are used as a four-way interleaved memory -
i. e., the low-order two bits of address determine which module is addressed.
Each of these functional blocks is described in further detail below.
The Processor
Each processor interprets microinstructions fetched from the micro-
program store, one at a time. A separate connection between each processor
and the microprogram store allows a processor to fetch a microinstruction
on each cycle of the store. Communication among processors and between a
processor and main memory is done via the bus. Requests for allocation
or deallocation of processors are also passed to the bus, which contains the
allocation mechanisms.
-4o-
A local register named STATUS contains a value which indicates the
execution state of a processor:
STATUS = 0 The processor is unallocated and is waiting for a startup
message from the allocator (l’STOP” state).
STAT US = 1 The processor is waiting for permission to transmit to
the bus (t5US-WATT” state).
STATUS = 2 Microinstructions are being actively interpreted
(“RLJN” state).
STATUS = 3 A Wait for input” instruction has been executed and input
has not yet been received (“WAIT” state).
Logical interconnections between processors are made only by use of the
FORK instruction. When a FORK is executed by a processor, a request is
output to the allocator (transmitted on the bus). The request contains a starting
microprogram address. If an unallocated processor exists, it is allocated to
the requesting processor as its “slave”. The allocator sends a startup message
to the slave processor containing the starting address; then it sends an acknowl-
edgment message to the requesting processor (the 9naster1’).
Each processor contains two local registers, named MASTER and SLAVE.
When a processor is allocated, its MASTER register is filled with the index
number (%amel) of the processor to which it is slave. The master processor
receives the name of its slave in the acknowledgment message, and places it in
the SLAVE register. In case of failure to allocate (i.e., no processor available).,
an immediate acknowledge message is sent to the requestor, and its SLAVE
register is set (remains) equal to zero. All allocation requests are acted upon
immediately (during one or two bus cycles). Requests which fail to receive a
slave are not remembered by the allocator.
-41-
The contents of the MASTER and SLAVE registers may be tested (for =O
or #O) under microprogram control, but their contents may not be otherwise
accessed. Thus the anonymity of individual processors is preserved.
A processor may have at most one slave allocated to it. Its slave may in
turn have a slave, but this condition cannot be directly tested by the master.
Bus transmissions from a processor may be sent only to its master (if any)
or to its slave (if any).
The following are the microinstructions which deal with allocation and
interprocessor communication.
mnemonic code description
FORK request allocation of slave (starting address is in the accumulator)
STOP deallocate self and wait for startup
DET detach self from master (if any) and slave (if any), but continue processing
TSL test if SLAVE= 0 (set completion code)
TMA test if MASTER = 0
OSL
OSLF
OMA
OMAF
WIN
output data (from accumulator) to slave
output data to slave, with flag
output data to master
output data to master, with flag
wait for data input (from master, slave, or memory)
TIN test source of input (set condition code to indicate master, slave, or memory)
TIF test whether flag was received on last input (set condition code)
-42-
Main memory access is accomplished by the following memory-cycle-
initiating microinstructions:
MEMB start memory cycle - destructive (“half-“) read
MEMF start memory cycle - full read (“fetch”) and rewrite
MEMS start memory cycle - store
MEhUVI start memory cycle - read-modify-write (read and lock)
Each of these microinstructions causes a message to be sent to the
appropriate memory bank via the bus, taking the memory address from the
accumulator. The Store microinstruction (MEMS) would be followed by the
microinstruction to send a data word to memory:
OMEM output data to memory
The two read microinstructions would normally be followed by a wait-for-
input (WIN) instruction. The read-modify-write microinstruction (MEMM)
would normally be followed by both a WIN and an OMEM instruction. For both
the MEMS and MEMM cycles, the selected memory bank will wait until a data
word is received.
The destination memory bank (module) number is determined by the two
low order bits of the address. These two bits are copied into the register
named BANK whenever a memory cycle is initiated by the processor. When an
output data to memory (OMEM) microinstruction is executed, the module
number is indicated by the value in BANK.
Each processor contains an accumulator (called A) plus a block of eight
registers, named R(0) through R(7). R(0) is dedicated to use as the micro-
program counter. The following eight microinstructions operate on these
-43-
registers.
AR LR INCR EXR SR NR OR STR
Add R(r) to A, result to A Load A from R(r) Add one to R(r) Exchange A with R(r) Subtract R(r) from A, result to A AND R(r) into A OR R(r) into A Store A into R(r)
The last four of these microinstructions are not permitted to operate
on R(0).
Associated with the accumulator are an overflow bit and a link bit. The over-
flow is set to 1 whenever an addition or subtraction operation involving the ac-
cumulator produces a result which cannot be stored in 16 bits ( in signed %vo?s
complement71 representation). The link bit is used in certain shift operations.
Shifting may be performed left (SHL) or right (SIIR) in each of four modes:
logical (LN), arithmetic (AN), circular (CN), and circular with link (CL). The
shift amount (number of bits) of a single microinstruction may be 1, 3, 8, or
the number stored in R(1). A typical shift instruction, in assembler format,
would be
SHR LN, (1) Shift right, logical [no link- , one bit
The following miscellaneous microinstructions operate on A and the over-
flow and link bits :
INCA DECA -A ALNK LNKO LNKl -LNK ACV
add 1 to A- subtract 1 from A complement A add link bit to A set link to zero set link to one complement add overflow bit to A
The test microinstruction sets the condition code to a value depending on the -
contents of the accumulator; the condition code is tested in the conditional skip
-44-
microinstruction. These cause 1, 2, 3, or 4 following microinstructions to
be skipped if the tested condition is satisfied.
SKn ov skip n if overflow = 1 SKn GT skip n if A > 0 SKn LT skip n if A < 0 SKn EQ skip n if A = 0 SKn NO skip n if overflow = 0 SKn LE skipnifA<O SKn GE skip n if A 2 0 SIbI NE skip n if A + 0
The condition code is also set by microinstructions ThU, TSL and TIF.
The appropriate conditional skips following these are SKn EQ or SKn NE.
After a TIN (test source of input), the conditional skips are
SKn MA skip n if input from master (GT) SKn SL skip n if input from slave (LT) SKtl MEM skip n if input from memory (EQ) SKII 1MA skip n if input not from master (LE) SKn 7SL skip n if input not from slave (GE) SKtl -IMEM skip n if input not from memory (NE)
The correspondence between these and the previous set of conditional skips
is shown in parentheses.
Two “immediate” operand microinstructions are provided to allow intro-
duction of data or addresses from the microinstruction stream. These are:
LI i Load A with immediate value
AI j Append j to value in A
The operand i is six bits wide, composed of a sign and 5 bits of magnitude.
The sign bit is copied to the left to fill the upper 11 bits of A, including its
sign bit. The operand j is five bits wide. When AI is executed, A is shifted
left (logical, no link) five bits, and the value j is placed in the lower 5 bits of
A.
An assembly language variant of LI occurs, called LAI, “load A with imme-
diate address”. This is the same as LI, but the assembler will flag any value
-45-
of i which is negative or exceeds 5 (rather than 6) bits in width.
Communication between the bus and a processor is generally independent of
the microinstruction stream. When output to the bus is initiated by execution
of a FORK, OSL, OSLF, OMA, OMAF, OMEM, MEMH, MEMF, MEMS or MEMM
microinstruction, the processor does not wait for the output to be accepted by -
the bus. Only if another output microinstruction (including STOP) is executed
before the bus accepts a previous communication will the processor go into a
bus-wait state (STATUS = 1). Similarly, input data communication are buffered
until a WIN (wait for input) microinstruction is executed. If a second input is
received before a WIN is executed, the first communication is lost, and a fault
condition is noted.
The microprogram store
The microprogram store, called MICROMEM in the simulator, is a single
block of storage, 8 bits wide by 1024 words long, with a mechanism for trans-
mitting one word at a time to each of the 15 processors. The microprogram
location counters of the processors are connected to registers in the micro-
program store. Registers in the microprogram store connect back to the in-
struction registers in the processors. Whenever a nonzero value appears in
one of the input registers of the microprogram store, the value is used as an
address of a word to be fetched, and the word is transmitted to the corres-
ponding instruction register on the same cycle. The input register is then set
to zero. Any number of the processors may be serviced on each cycle.
The main store module
Each of the four main store modules incorporates a block of storage 16 bits
wide by 1024 words long. Whenever a request is received, the module cycles
through four states as described below. The address of a word to be accessed
is received with the request; for %tore7t and “read-modify-write”, a second
-46-
input, containing the data to be written (or rewritten), must be received before
the cycle can be completed.
There are four modes, modeled after the possible modes of access of a
typical core memory:
H: “Half-cycle read” The contents of the word addressed are sent to the requestor. The word contains zero afterwards.
F: “Fetch” The contents of the word addressed are sent to the re- questor. The contents are rewritten in the word, leaving it unmodified.
s: “Store” When the data item to be stored is received it is stored in the word addressed.
M: “Read-modify-write” The contents of the word addressed are sent to the requestor. When the data item to be stored is received, it is stored in the word addressed.
The algorithm for the main store module is as follows:
Phase 0 (Wait for request) Sleep until a request is received.
Phase 1 (address clock-in delay) Delay;
Phase 2 (destructive read) Delay Buffer - MEMORY [address] ; If H, F or M, output contents of buffer to requestor; MEMORY [address] - 0; If H, go to phase 0;
Phase 3 (rewrite) Delay; If S or M, wait until data word is received (in buffer); MEMORY [address] +-buffer; Go to Phase 0.
There are four main store modules in the system, named MEMO , MEMl,
MEM.2, and MEMS. The bus directs requests for memory access to one of these
four based on the low-order two bits of the address transmitted. In the algorithm
above, the main store modules use the upper 10 of the 12 address bits received.
The four modules thus form a 4096 word 4-way interleaved memory.
Each module maintains a first-in-first-out queue of all transmissions re-
ceived. Transmissions are always accepted; when the module returns to Phase 0,
it checks the queue before “going to sleep”.
The queue allows access requests and data to be transmitted asynchronously
from all the processors. Each queue element contains information identifying
the source processor; this allows data (for example for a store operation) to be
located in the queue even though the arrival of an access request from another
processor may have intervened between the data and the store request. The
queue also permits the “modified” data of a Read-modify-write access to arrive
at the memory before the original data has been sent to the requesting processor.
The Bus and Allocator
The bus is the sole connection between the processors and main memory and
among the processors themselves. Integral with the bus mechanism in the sim-
ulation is the allocator, which determines the assignment of ‘klave” processors,
and maintains information on the status of processors.
The bus itself transmits 32 bits. The usage of these bits is identified in
Figure 2-2 below.
b 1 D 1 ClMwB 1 +-data-
S = source code (4 bits)
D = destination code (4 bits)
C = bus mode (2 bits)
M = memory mode (2 bits) (significant only when S = 0 and C = 3)
B = memory bank (2 bits) (significant only when S = 0 and C = 1 or 3)
Data length is 16 bits.
FIG. 2-2 Z-machine bus bit assignments
-48-
In general, the source and destination codes correspond to the processor
number, when nonzero; the allocator and memory are destination zero. Bus
modes o and 2 are associated with the allocator, and modes 1 and 3 deal with
transmissions among processors and memory. Figure 2-3 diagrams the eight
possible types of bus transaction.
Bus, mode (C)
Destination (D) 2
zero
nonzero
FORK data to memory DETACH/STOP Address to memory
Fork - startup data to Reset (due data with flag or acknowledge processor to detach) to processor
FIG. 2-3 Z-machine bus transactions
The meaning of these transactions is as follows:
Mode 0 (FORK, startup, and acknowledge). When a FORK micro-
instruction is executed by a processor, the FORK message (con-
taining a startup address) is sent to the allocator. If the FORK
succeeds, the destination is transformed by the allocator into the
number of a free processor, and is transmitted as a startup message.
The next bus cycle is then pre-empted for an acknowledge message
to the FORK requestor. The source code of an aclmowledge message
is the code of the newly allocated processor. If the FORK fails, an
acknowledge message is generated immediately, with a source code
of zero.
Mode 2 (DETACH or STOP). When a DETACH or a STOP micro-
instruction is executed by a processor, a mode 2 message is sent to
the allocator, destination zero. These messages are distinguished
-49-
by the contents of the message. If the data bits are all zero, it is a
STOP-requesting that the processor be deallocated. Otherwise, the
contents of the MASTER and SLAVE registers of the detaching
processor are contained in the data bits. These are verified against
the duplicate master-slave information in the allocator, then one or
two Reset messages are generated by the allocator and sent to the
master (if any) and slave (if any) of the detaching processor. The
reset messages contain new master or slave values for the processors
receiving the messages. The detaching processor is not deallocated,
but is, in this manner, removed from the chain it was in.
The allocator also checks the chain connections of a processor be-
fore allowing a STOP to cause deallocation. If a STOP comes from a
processor which has a master or a slave attachment or both, the
STOP is transformed into a DETACH and acted on as above, followed
by deallocation.
Mode 1 (data). A nonzero destination indicates a simple data trans-
mission to a processor (caused by execution of an OSL or OMA micro-
instruction). Zero destination indicates a data transmission to memory
(caused by an OMEM microinstruction). The memory bank number is
given in the B bits.
Mode 3 (Data with flag). A nonzero destination causes transmission
of data “with flag” to a processor (caused by an OSLF or OMAF
microinstruction). Zero destination causes a memory-cycle-start
message to be sent to memory. The memory mode is determined by
the M bits; an address is contained in the data portion of the trans-
mission. The memory bank number is determined by the B bits
(which equal the lower 2 bits of the address).
-5o-
Status is maintained in the allocator in three arrays, named BUSY, MASTER,
and SLAVE. Each array has one element per processor. BUSY [il is a single
bit telling whether processor i is allocated ( = 1) or not ( - 0). MASTER [d and
SLAVE [i] are four bits each, giving the indices of processors logically con-
nected to processor i in a chain.
On cold start, BUSY 11) is set to 1, and a startup message is sent to pro-
cessor 1, with the startupaddress 000.
The Allocator
The allocator is the arbiter of both bus access and of processor assignment.
Its algorithm is described below.
Algorithm A (Allocation of bus cycles and management of processor activity)
Bidirectional connections exist with each of four memory modules (num-
bered 0 to 5) and fifteen processors (numbered 1 to 15). Each connection may re-
quest access to the bus at any time. A request may consist of a request for trans-
mission of a message to one of the other connections (requiring one bus cycle),
or it may be a request for one of the control actions of the allocator. (requiring at
least one cycle).
Three tables are maintained by the allocator to keep track of the states and
connectivity between processors. BUSY [l:ld is an array whose elements have
a value of 1 or 0, indicating the non-availability or availability, respectively, of
the correspondingly numbered processors. MASTER [ 0:15) and SLAVE [ 0:15] are
arrays which maintain the connectivity between processors allocated via the FORK
command.
Initially, BUSY [i] = 0 f or 15 i 5 15, and MASTER [0] = 0 (always).
On the first cycle, BUSY [I] +- 1, SLAVE [0] +- 1, and a startup message
is sent to processor 1.
-51-
1. (Check for memory module access request) Scan around the four
memory connections, beginning with the one following the one which
last was granted access, until a request is found or all four have been
checked. If a request is found, place it in COPY and go to Step 3.1.
2. (Check for processor access request) Scan around the fifteen pro-
cessor connections, beginning with the one following the one which
last was granted access, until a request is found or all fifteen have
been checked. If a request is found, place it in COPY. If it is a con-
trol request, let j=number of the source processor, and go to step 4.
3.1 (transmit a message) Transmit the contents of COPY to the requested
destination connection.
3.2 Wait for one bus cycle time. Go to step 1.
4. (perform control action) If the request is a STOP, then go to step 5;
If the request is a DETACH, then go to step 6. Otherwise, go to
step 7.
5. (do STOP) Set BUSY [j]- 0, where j = source processor number.
If SLAVE [j] = MASTER [j] = 0, then wait for one bus cycle and go
to step 1.
-52-
6. (do DETACH) If MASTER[ j] # 0 then send a detach message to
processor MASTER[ j] and wait for one bus cycle;
SLAVE [MASTER[ j] 1 +SLAVE[ j] ;
If SLAVE[ j] # 0 then send a detach message to processor
SLAVE[ j] and wait for one bus cycle;
MASTER [SLAVEI j]] - MASTER[ j] ;
MASTER[ j] + 0;
SLAVE[ j] + 0;
If no wait has yet been performed in this execution of step 6,
then wait for one bus cycle.
go to step 1
7. (FORK request)
If BUSY [i] = 1 for 1 s i sl 15, then send failure message to
processor j, wait for one bus cycle, and go to step 1.
Set i = smallest integer such that BUSY [i] = 0;
BUSY [ i] .- 1;
MASTER[ i] h- j;
SLAVE[ j] t i;
Send a startup message to processor i, wait for one bus cycle,
and go to step 1.
-53-
“The engineering of large software systems has become too complex for a single package design to have a reasonable chance of success upon implementation. ”
H. C. Rats, 1971
CHAPTER 3
Introduction
DESCRIPTION OF EXAMPLE PROGRAMS
The three programming examples described here were chosen (1) to
illustrate algorithms which make use of the available multiple asynchronous
microprocessors in novel and interesting ways, (2) to exercise the multiple
microprocessor system in ways which will demonstrate its capabilities and
limitations due to organization, (3) to demonstrate implementations of plausible
“higher-levell’ machine language instruction.
The first two examples perform in-place sorting of a table of N numbers.
Example 1 uses a “bubble” or %iftingVt sort algorithm, in which up to N-l pro-
cessors are connected in a chain through which the numbers are passed. This
algorithm avoids main memory references at the expense of increased inter-
processor communication.
Example 2 performs a sort using the Shell (or “diminishing increment”)
method; increments equal to ak are used so that data dependencies are predictable.
This algorithm is “fasterl’ as N becomes large, but depends on memory references
to a greater extent than example 1.
The third example, matrix multiply, performs a relatively large amount of
numeric computation.
The emphasis throughout these examples is to arrive at techniques for con-
quering the computational burden by using as many processors as possible,
-54-
without depending upon the availability of any specific number of processors
(greater than 1).
Expected Run Time Variation
In each example, the expected computation time depends in different ways
upon the size of the problem (N) and the number of processors available (k).
The computation time, of course, depends on the speeds of the system com-
ponents. Below, we develop the expected run times in terms of number of
main memory accesses, since this is the limiting factor in examples 2 and 3.
For example 1, the number of interprocessor data transfers and the number
of FORK commands executed are also taken into consideration.
In example 1, the sifting sort, the sorting process is an “N‘” process -
that is, the number of comparisons to be performed varies as the square of
the number of elements in the table. In the implementation here, the time to
compare two elements is small compared to the time required to fetch an
element from main memory, but the number of fetches nonetheless increases
with N2. The total loading of the interprocessor bus is expected to be a
limitation also since each data element is passed on the bus an average of l/ 2
N times. In addition, the number of forks attempted is shown since this number
rises with the size of the problem as long as the number of processors avail-
able is less than N-l.
The variation of these three items with N and k is shown below, based on
the actual implementation.
N = number of data elements in table
k = number of processors available
number of memory accesses: M=N2/k+N
number of bus accesses: B=N2/2
number of FORKS attempted: F= (N2/k+N)/2
-55-
The number of bus accesses, B, does not include memory accesses, although
the address and data are, in fact, transmitted on the bus. The two tables be-
low illustrate the variation of those numbers with N and k. In table 3-la, we
set N=32 and show how the numbers increase for decreasing numbers of pro-
cessors available.
TABLE 3-la
EXAMPLE 1: VARIATION OF PROBLEM CHARACTER- ISTICS WITH NUMBER OF PROCESSORS
l------------ k, number of processors available 1 N=32 32 16 8 4 2 1
Memory I accesses, M 64 96 160 288 544 1056
bus accesses, B 512 512 512 512 512 512
forks attempted, F 32 48 80 144 272 528
We see in the table above that for small numbers of processors, the memory
accesses become the dominant factor, particularly if we assume that memory
access time is 4 to 10 times the bus cycle time. The number of forks attempted
also increases, adding to the computation overhead as the number of processors
decreases. This is the penalty we pay for attempting to maximize the usage of
all available processors. In this system, the fork overhead is not large, but
in a system with a slower allocation mechanism, it could become significant.
-56-
The table below shows the variation with N for a constant number of processors
(k=16). The last line of the table shows, for comparison, the variation of
Nlog2N. Typically, a good sort algorithm on a single processor would run in
some small multiple of this number.
TABLE 3-lb
EXAMPLE 1: VARIATION OF PROBLEM
CHARACTERISTICS WITH NUMBER OF DATA ELEMENTS
N, number of data elements in table
Nlog2N 64 160 384 896 2, 048 4,608 10,240
Here we see that while M is increasing with N2/ k, B varies with N2/ 2. There-
fore, we would expect that the bus should be capable of handling the load as long
as the ratio of memory cycle time to bus cycle time is greater than half the
number of processors. (That is, if execution time is limited by either M or B,
the load is “balanced” when Me TM 3 B. TB; or TM/ TB = (N2/ 2)/ (N2/ k) = k/ 2.
Thus in this case, a cycle ratio of 8 should support up to 16 processors
without bus saturation. These numbers, of course, do not account for the
spreading of the processing load over time due to the asynchronous nature of
the processors and the memory. This effect would tend to leave the bus un-
saturated for slightly higher loads than indicated.
Example 2, the Shell sort, spends most of its time executing a series of
insertion sorts. The expected number of interchanges required to order M -57-
numbers using an insertion sort is developed in Knuth [Knuth, D. E. , “The art of
computer programming, Vol. 3: Sorting and Searching”, Addison-Wesley (1973),
Section 5.2. 1, pp. 84-951. The improvement dl to the Shell sort relies on
the partially-ordered properties of the table after each insertion sort: The
number of insertions expected on each pass is decreased because of the sorting
that has been performed already.
Only the worst-case sorting time is considered here; this at least gives an
estimate of the expected improvement with multiple processors.
Only the number of memory references required in the example 2 program
is considered here since this is the dominant factor. Some of these references
are “read-modify-write”. An insertion sort of m items requires at most
m2 + 2m - 2 memory references.
In example 2, we perform
1 insertion sort of N items
2 insertion sorts of N/ 2 items
4 insertion sorts of N/4 items
2 r-2 insertion sorts of 4 items
where N = 2’. The sorts are performed on the smaller number of items first,
and we have chosen not to perform the two-item sorts. Assuming the number
of memory references is m 2
for a sort of m items, and summing over these
sorts, we obtain
M=2 r-2. 42+2r-3. S2+. . .+ 1. N2
= $-. 22+2r-1. 24+. . . +z2. 22r-2
= 2r. kz22i
-58-
But now, with the parallelism obtainable with multiple processors, we can, in
principle, perform all of the first 2 r-2 insertion sorts at the same time since
they operate on different data. After these are completed, the next 2 r-3 sorts
may be performed in parallel, and so on. (In fact, the example 2 program
allows each of these “pass 2” sorts to go ahead as its data are released from
the two insertion sorts in ‘pass 1” which operated on them. ) Assuming that
at least N/4 processors are available, the execution time in terms of memory
references becomes
M = 42 + B2 + . . . +2 2r
= 22r (24-2r + 2+2r + . . . + 1)
=22r(1 +2-2+2-4+...)=4N2
Thus, the maximum effect of parallelism based on this simple (worst-case
insertion sort run time) model is a reduction of run time by 339?? (4/3 N2 rather
than 2 N2 above). This reduction is so small because the dominant term in
each case is the last one (N2) which is due to the insertion sort of N items,
which cannot be speeded by parallelism. The Shell sort would tend to be
better than this (and would be speeded more by parallelism) because the N
item insertion sort starts with data which are already partially arranged.
In example 3, the matrix multiply, we find that the number of memory
references is 5N3 + 1 1N2 + O(N), for the one processor case, where N=number
of rows and columns of the square matrices. In the implementation of this
example, processors are initiated as fast as their data offset addresses can
be calculated, so that we expect the run time in terms of memory references
to be approximately 5N3/ k, k=number of processors available.
-59-
However, memory references are not the limiting factor in this calcu-
lation. The number of microinstructions required to compute a product is
approximately 150 (fewer, in case of a zero operand or overflow). Thus, the ”
matrix multiply could require up to 150 N’ microinstructions, plus the time
for data access. For N=6, this yields I=63 = 32,400 microinstructions to be
executed. Compare this with M = 5. 63 + 11. 62 = 1476 memory references.
Even if a memory reference requires 10 times as much time as a micro-
instruction execution, the microinstruction time dominates by a factor of 2
(32,400 vs. 14,760 instruction-times). Asymptotically for increasing N, the
microinstruction execution time would dominate by a factor of 1 50N3/ (10’ 5N3)
= 3. Thus, we have a program which is computation-bound, Its execution
time varies with N3 and inversely with k, the number of processors available.
The computation overhead does not increase with increasing number of pro-
cessors, SO we expect the run time to approach the l/k curve typical of com-
putation-bound processes, as long as the computation is not competing with
other processes for resources.
-6O-
Table 3-2 summarizes some of the empirical characteristics of these
three examples. Example 2 is the heaviest user of main memory, primarily
because some of the actively-used address counters are stored in a stack in
main memory. Example 3 uses main memory in bursts - whenever a sum
accumulation is done and new operand addresses are calculated. Example 1
makes very few references to main memory except when the number of avail-
able processors is very small. Processor utilization is light except for the
%ompute-bound” Example 3.
Each example has a total run time which depends on the size of the table
or matrix. However, Example 2 is the only one in which the values operated
on have a large influence on the execution time; the Shell sort time can be sig-
nificantly lowered because the insertion sorts are shorter when the data are
“well-conditioned”. Example 3 is slightly data-dependent - the multiplication
terminates early when the multiplier or the multiplicand is zero.
The actual size of the microcode used to implement each of the examples
is shown. Examples 2 and 3 use, in addition, a stack in main memory. In
Example 2, with N = 100, 500 words are used; in Example 3, with N = 6, 113
words are used.
The maximum theoretical number of processors is the number which could
be used if allocation were limited only by the algorithm. Example 1 could use
31 processors to sort a 32-length table. Example 2 could, in principle, use
16 processors during the first part of a Shell sort on a loo-item table. (In
practice, the first eight processors terminate so quickly that only eight pro-
cessors were ever active at one time in this study. ) Example 3 uses one pro-
cessor on each inner product, using 36 of them for the 6 by 6 case here. (In
practice, a maximum of 13 processors were active at one time in this study. )
-61-
TABLE 3-2
SUMMARY OF PROBLEM CHARACTERISTICS
l&le 1 Example 2 Example 3 Sifting-Sort Shell Sort Matrix Multiply
N =32 N = 100 N=6
Main memory usage light heavy moderate (heavy bursts)
Processor usage
Run time is dimension-dependent
Run time is data-dependent
light
yes
IlO
light
yes
yes
heavy
yes
3-s (multiply only)
Storage requirements Microcode Main memory stack
Maximum theoretical No. of processors
162 bytes none
N-l (31)
306 bytes 5N words
16 practical : 8
341 bytes 3N2 + 5 words
N2 (36)
-62-
Processor Allocation and Deallocation
At the start, there is a pool of up to fifteen processors available to the al-
locator. These processors are allocated by means of a FORK microinstruction.
An unallocated processor is in a STOP state, waiting for a startup message from
the allocator (in the bus). The startup message gives the address of a starting
microinstruction.
For cold start, a FORK 0 startup message is generated autonomously by the
allocator, causing the first processor to begin executing at location zero. At this
point, we would diagram the operation of the one processor as below.
Fl PI
The symbol at the top indicates that Pl has no 9naster1~ attached to it - it is
operating at the top level. If now PI executes a FORK microinstruction which
is successful, the resulting diagram would be
Here, the connecting line between Pl and P2 indicated that they are in a
master-slave relationship: to send a data word to P2, Pl would execute an “out-
put to slave” microinstruction; similarly, P2 would execute an ‘toutput to master”
to send a word the other direction.
-63-
If P2 now executed a FORK which is successful, the result is
Pl
8
P2
P3
We note that P2 can detect the presence of the slave, P3, by executing a
“test slave” microinstuction, which sets a condition code based on whether a
slave is attached or not. Similarly, a processor can test whether it is operating
at the top level or not by executing a “test master”.
To remove itself from a chain, a processor may execute a DETACH micro-
instruction. Lf P2 executes a DETACH while in the configuration above, the re-
+1 P3
Thus, we now have two active chains. Note that the masterof P3 has become Pl;
the chain has been “pulled up”. Normally, P2 would now execute a STOP micro-
instruction which would put it back in the pool of unallocated processors. This
is not required, however - it is possible for P2 to continue processing. But it
can no longer communicate directly with Pl or P3.
Not also that a fault condition results if Pl executes a FORK while a slave is
attached to it. A DETACH is permissible at any time. If a STOP is executed while
a processor is attached to either a master or a slave, the DETACH operation is
-64-
first automatically generated by the allocator.
Example 1: Sifting Sort
A set of numbers is stored in consecutive locations of main memory, starting
at location I (initial) and ending at location F (final). The numbers are to be re-
arranged into non-decreasing order.
The algorithm here creates a chain of processors, each of which “sifts” the
data being passed to it, always keeping in a local register the smallest data item
it has seen. The “top” processor fetches the data from the table and after comparing
the first two, passes the larger to the second processor. After fetching and com-
paring all of the data, the top processor puts the remaining item, which is the
smallest, into the first table position. The second processor has had all but that
one item passed to it, so the one remaining is the second largest, and it places
it in the second table position. Similarly for the rest of the chain of processors,
except that the bottom processor stores the last two items. Thus, up to N-l pro-
cessors may be used to sort a table of N items.
What happens when fewer than N-l processors are available? Since each data
item being handed down the chain vacated a position in the table, the table itself
may be used to buffer the partial results; the bottom processor, when it is not the
(N-l)th one, stores the data it would otherwise pass on. After the last item has
been stored (and all the processors above it in the chain have terminated), the
bottom processor @-the top processor and it restarts the sort on the remaining
data.
The basic algorithm is as follows:
Algorithm B (Bubble, or sifting, sort)
N data items in memory, stored at location I through F, are sorted in place.
If N-l processors are available, each data item is fetched only once. If only
K < N-l processors are available, then the first K items are sorted on the first
-65-
pass (Le., fetching each of the N items once), etc.
ltGet’l means either fetching a data item from memory or receiving one from
the processor above in the chain (the ~~master~‘); “PuVt means either storing a
data item in memory or sending it to the next processor below in the chain (the
“slave”).
Bl.
B2.
B3.
B-4.
B5.
B6.
Bl.
B8.
B9.
(Initialize) Get an item -A.
(Receive next item) Get an item - B.
(Compare and exchange) If A> B then exchange A and B.
(test for (local) end) If no more items, go to step 8.
(check on chain-building) If a slave is already attached, go to step 7.
(add to chain) Request allocation of a slave (slave to start at step 1).
(send item) Put B; go to step 2.
(Finish with A and B) Put B; store A into table.
(test for chain position) If a slave is attached, terminate.
B10. (test for (global) end) If I am processor N-l, return to Calling program (done); otherwise, increment my number and go to step 1.
The top processor always fetches data from memory. A processor which has
requested a slave, but did not receive one, stores all of its data into memory.
To coordinate passing of data from a master to a slave , a token (a data word)
is passed from slave to master each time the slave is ready to receive a new
data item.
The master distinguishes the token from inputs from memory and from its
master by using the “test input” instruction. (See the detailed flowchart in
Figure B-6, Appendix B.) Thus, step B7 above may be delayed while waiting
for a token from the slave, and step B2 may include sending a token to a master
and waiting for the next item to be sent in response.
-66-
Example 2: Shell Sort
The Shell Sort algorithm is a faster method for large tables than the
sifting sort used in Example 1. Basically, this method performs repeated
‘Msertion” sorts on subsets of the items in the table, working on successively
larger sets. The method gains speed over the sifting method because the sets
in successive passes are partially sorted. This method is described in detail
and analyzed in Knuth [ Knuth, D. E. , op. cit. ] , where it is called a “diminishing
increment sort”. We have chosen increments equal to 2k in order to make the
process neatly parallel and to make the data dependencies orderly.
We define two parameters - a pass number, P, and an offset number, R.
The pass number varies from 1 to log2N, where N is the smallest power of
two larger than or equal to the length of the table. On each pass, each element
of the table is involved in one insertion sort. There are N/2’ of these inser-
tion sorts. The offset number, R, designates these sorts, varying from N/ 2 P
down to 1. R is called the “offset” because it also designates where in the
table the first element of the insertion sort is to be found. The elements in an
insertion sort for pass, P, offset R are to be found at locations R, R+N/2P,. . .
The simplified flowchart in Figure 3-3 describes the procedure used to im-
plement this Shell sort algorithm. The first processor begins at START in the
flowchart, and initializes its offset number, R, to N/2. Then, entering the
‘pass” loop, it initializes the pass number, P, to 1. Next, “Lock TABLE(R)”
is executed, meaning that the first element to be used in the insertion sort is
removed from the table and a special value is replaced in it. This prevents
another pass from starting prematurely. In Figure 3-4, the table locations
which are at one time or another Yocked”, are shown with boxed X’ s. Next,
entering the insertion sort loop, the processor checks to see whether another
processor could be utilized on this pass. If R=l , another is not needed because -67-
R = offset number
P = pass number
1 Lock T?BLE (R)j
FIG. 3-3. Shell sort - simplified flowchart.
-68-
this processor is doing the last insertion sort of the pass. In this case, it
skips immediately to the insertion sort, shown as SORT(P, R).
Otherwise, the processor executes a FORK microinstruction,, which re-
quests allocation of another processor. The double lines indicate potential
startup of a second program branch, with the “slave” processor beginning at
the SLAVE START node. If a slave was allocated (the “master” receives an
acknowledgment if it was), the “master” sends initialization parameters to the
“slave”: an offset number equal to one less than that of the “master”, plus
addressing information. The “master” then goes on to perform the insertion
sort for that pass.
Each physical processor keeps its offset value, R, constant for as long as
it is allocated. It does, however, go on to perform insertion sorts in subsequent
passes, if needed. Observe, for example, in Figure 3-4 that once the first
three insertion sorts are complete (pass and offset equal to (1,4), (1,3), and
(1,2)), the fifth one (2,2) can be performed without waiting for the fourth one
(1,l). In the flowchart, then, we see the processor which does the (P, R) in-
sertion sort incrementing P and checking to see if the (P+l , R) sort is one which
needs to be done. This is shown as the comparison of the current P to Pm,,
where Pmax is equal to N/2R. If the comparison indicates that another pass is
needed for this value of R, the processor stays in the insertion sort loop. If a
slave has not yet been allocated, the processor will again go through the test to
see if one is needed, and re-request one if necessary.
When no more passes are needed for this value of R, the processor
“unlocks” the table element at TABLE(R), and tests to see whether it is needed
for any further computation. If R=l, the Shell sort is finished. If R > 1, there
is more to be done; if a %lave” has already been allocated, the “slave” will
take care of further allocations, and this processor returns itself to the pool
of free processors. In the case that no slave has been allocated, this (only
available) processor takes on the role of the slave by decrementing its offset
number, R, and going to the beginning of the r’pass” loop.
‘ass, OfEset, Table Location Physical Processor P R 1 2 3 4 5 6 7 8 Number
[xl X
Eel X
El X
El X
rji-J x x x
@ x x x
[XX xxxxxx
Figure 3-4. Shell sort - implementation strategy, N=8
Figure 3-4 illustrates the implementation method for a table of length 8.
The pass number and offset are listed at left, and an X is shown under each
table location which is involved in that insertion sort. If this sort were per-
formed with a pool of only two processors, then the physical processor per-
forming the insertion sort would be the one whose number appears in the right
hand column. The execution with two processors would be as follows: Pro-
cessor number 1 initializes P to 1 and R to 4 and f’locks” location 4. This
means that it stores a special value into location 4 as a flag to prevent a later
pass from using it before the insertion sort is finished. The contents of the
Yockedll location are kept local to the processor during the insertion sort.
Next, processor 1 executes a FORK, and starts up processor 2 on the sort for
P=l, R=3. Processor 1 then proceeds with the insertion sort. Meanwhile,
-7o-
processor 2 attempts a FORK, but fails to receive a slave. It then proceeds
with its sort. When processor 1 completes its sort, it discovers (in the P:Pmax
comparison) that no more insertion sorts at offset 4 are required, and it, there-
fore, “unlocks” location 4 and terminates. Processor 2 then finishes similarly,
except that when it finds no slave attached, it goes on to perform the sort for
P=l, R=3. It also succeeds with its FORK this time, starting processor 1 on
the sort for P=l , R=l. (Note that now processor 1 is the slave of processor 2).
When processor 2 finishes this time, it goes on immediately to the P=2,
R=2 sort. Location 2 in the table remains “locked”. Similarly, processor 1
goes on to the sort for P=2, R=l . Next, processor 2 finishes and terminates.
Processor 1 goes on to pass 3, offset 1, the final insertion sort. Note that in
general, the final sort may have to wait until the P=2, R.=l sort is finished; a
potential wait occurs whenever a boxed X occurs below another one in the figure.
See Appendix B for a detailed flowchart of the implementation.
Since the overhead of setting up an insertion sort is considerable, we have
chosen to eliminate pass 1 (increment = N/ 2) in the implemented algorithm so
that the initial increment size is N/4.
Example 3: Matrix Multiply
This algorithm performs matrix multiplication on two square matrices.
The implementation here performs fixed-point multiplication using shift, add
and test microinstructions. Each processor works on one of the N2 required
inner-product calculations, accumulating the sum in the result matrix. Figure
3-5 shows a simplified flowchart of the matrix multiply as implemented. In
order to claim as many processors as are available, the bottom processor, if
it is not the last one required, requests a slave after every multiplication and
sum.
-71-
n0
-+
more ?
9s
I
,i
FORK L-r FORK
L
Initialize
yes
yes
23 stop
FIG. 3-5. Matrix multiply - simplified flowchart.
-72-
Figures 3-6 and 3-7 show the contents of memory, and of some of the
processor registers, for an example, with N=2 (2 x 2 matrices). Matrix A is
stored at locations 300 through 303, in row order. Matrix B is at 400 and C,
the result, is at 500. The final stack contents are shown in Figure 3-6. The
first processor is presumed to contain the three matrix base addresses and
the length, and to have the stack base address. As each processor is allocated,
the master processor computes the next vector base addresses (shown as AADR
and BADR), the indices (I and J), and the result element address (CADR). It
then passes these values along with the length (LEN) to the slave. The slave
then builds a stack area for temporary storage and continues by attempting to
allocate another processor. The end is recognized by the last processor when
I = J = LEN. See Appendix B for a detailed flowchart and the program listing.
-73-
. .
Location Name
600 FLAG
601 IBASE
602 JBASE
603 ABASE
604 BBASE
605 AADR(l)
606 BADR(l)
607 STKBASE 608 AADR(2)
609 BADR(2) 60A STKBASE 60B AADR(3) 60C BADR(3)
60D STKBASE
60E AADR(4)
60F BADR (4) 610 STKBASE
Initial Contents
300
400
Final Contents
0 or -1
2
2
300
400
300
400
600 301
401
600
302 402 600
303
403 600
Comment
set to -1 if over- flow occurs
FIG. 3-6. Stack contents for matrix multiply, N=2
-74-
Processor
Register Name Pl P2
STK 605 608
CADR 500 501
LEN 2 2
I 1 1
J 1 2
AADR 300 300
BADR 400 401
P3 P4
60B 60E
502 503
2 2
2 2
1 2
302 302
400 401
FIG. 3-7. Register contents for matrix multiply, N=2
-75-
“‘But to get things done, you must love the doing,. . . 1’11 be glad if people who need it find a better manner of living in a house I designed. But that’s not the mo- tive of my work. Nor my reason. Nor my reward. ?ll
Ayn Rand, The Fountainhead, 1943
CHAPTER 4
MEASUREMENTS AND RESULTS
Measurements
For the purpose of determining efficiency and relative loads, a large number
of measurements are taken on each simulation run and are printed on the summary
sheet. A sample summary sheet is shown in Figure 4-l. The following terminol-
ogy applies in the descriptions of these measurements:
LENGTH means the run time of an example, expressed in number of processor cycle-times
WIDTH refers to the number of processors executing con- currently, at each cycle
AREA is the total number of processor cycles allocated to the example. It is the measure of the “processing” resource received by the example, and thus not avail- able to any other user of the system. The area is computed by
length
c WIDTH (t)
t=1
In the simulation, most of the measurements are gathered in the bus module.
Raw counts of the length and area are made in terms of the bus cycle time. When
the bus cycle time and the cycle time of the processors are not the same, these
numbers are multiplied by the ratio, to render the length in units of processor
cycle-times. These are shown as t’lengthll and “adjusted area” on the summary.
-76-
9 424 1.1.. 10 993 ****t****** ,I *J+t+, .******ttt*******~**~*************..~*,~,*~~~.**** I* 4023 *******t**tt**f**l***~********..**~.~~,,,***~ 13 786 t.... t...
FIG. 4-l. Sample summary sheet.
The total number of microinstructions executed by all processors, shown as
?&al minstr” on the summary, is the sum of actual instruction counts kept by
each processor module.
We now have the materials for the first important measure: how many of the
processor cycles allocated t.c the problem were used to execute microinstructions?
This efficiency measure is given as “true minsct/adj area” in the summary. The
number of processor cycles spent in waiting is shown as 71measured wait count”,
and is also displayed as a percentage of the area. Other measures gathered are
the following:
Average microinstruction parallelism: The total number of microinstructions
executed, divided by the run time , gives the average 3vidth11 of parallelism.
The maximum width is also shown, as well as a histogram of the width versus
the number of cycles that the number of processors was active.
Average allocated parallelism: The total number of processor cycles avail-
able to the program, divided by the run time, gives the average parallelism
as seen by the allocator. This is the measure of processor resource demand
by a task as it would compete with other tasks in the system. The ratio of
this average to the microinstruction parallelism average is the same as the
microinstruction/area efficiency measure.
Bus traffic : The total number of bus communications, including control,
inter-processor, and processor-main memory transmissions, is. tallied here.
Also displayed are the traffic per microinstruction, the traffic per run time,
and the percentage of total bus bandwidth used.
Memory references: The total number of main memory cycles initiated is
tallied, including fetch, half-read, store, and read-modify-write cycles. Al-
so displayed are the number of cycles per microinstruction, cycles per run
-78-
time, and percentage of maximum memory bandwidth used.
Memory mode count shows the count of each type of memory reference.
Memory bank count shows the number of references to each of the four main
memory modules.
Successful FORKs is the number of times the allocator provided a slave
processor to the task (including the cold start). FORKs attempted is the
number of times a slave was requested. The number of successful FORKs
is often characteristic of the problem: In Example 1 (sifting sort), it is
always equal to the table length minus one; in Example 3 (matrix multiply),
it is equal to the number of inner products to be performed (2).
Using the measures just described, we answer the following questions in ths
this chapter.
1. How frequently, on the average, does a microprogram refer to main
mmory? How frequently does it require bus communication? Are
these frequencies characteristic of the microprogram?
2. What bus capacity (bandwidth) is necessary to achieve adequate efficiency
for each example? Is there a value which satisfies all examples?
3. How important is main memory cycle time? Are there any examples
which are unusually dependent on memory speed?
4. How many processors can be used by each of these examples? What
limits the amount of parallelism achieved?
5. How much does run time decrease when increasing numbers of processors
are made available?
6. Assuming that the cost of a system goes up linearly with the number of
processors in it, what number of processors minimizes the cost per
throughput for these examples?
-79-
The three examples were deliberately chosen to have different characteristic
resource demands. Example I, the sifting sort, has relatively few references to
main memory, but uses many bus cycles for inter-processor communication.
Example 2, the Shell sort, depends heavily on main memory access, with rel-
atively little other bus communication. Example 3, matrix multiply, is the only
example to perform a large amount of computation.
Results
1. Bus and Memory Utilization
A not-too-surprising result, suggested by question 1 above, is that the aver-
age number of bus requests and the average number of main memory accesses
per microinstruction executed remains nearly constantfor each example, over
a wide range of running conditions (see Table 4-2). The change of memory usage
with varying availability of processors in Example 1 shows clearly in this
measure, while in the other two examples it is clear that serialization of the com-
putation with few processsors does not affect the average occurrence of memory
or bus accesses per microinstruction. TABLE 4-2
AVERAGE UTILIZATION OF BUS AND MAIN MEMORY
Main Memory References Bus Accesses Per Micro- Per Micro- Instruction Instruction
Example 1 - 15 processors 0.5% 7%
Example 1 - 1 processor 5 % 11%
Example 2 10 o/c 20%
Example 3 4 o/c 11%
NOTE: BUS accesses include address and data transmissions to main memory
and data transmissions to processors from main memory. “Main memory
references” is the number of fetch and store cycles initiated. - -
-8O-
In the Z-machine, all memory access takes place by communication across
the common bus. Wherever memory access time is referred to in the measure-
ments to follow, the time includes two bus cycles (one for address transfer, the
second for data transfer). Memory rewrite time (when simulated, as for core
storage) is not included.
2. Bus Capacity
The general result on bus capacity, in answer to question 2 above, is that
for all three examples, a bus cycle time equal to the processor cycle time is
adequate to maintain good efficiency (microinstructions/area). In particular,
when the memory cycle time is not more than four times the processor cycle
time, a bus cycle time equal to the processor cycle time yields efficiency in all
examples of 74% or better. This efficiency is reflected correspondingly in the
run time.
Tables 4-3, 4-4, and 4-5 show the efficiency results of running the three
examples with varying bus and memory speeds. All speeds are expressed as
a ratio to processor cycle time. The results of these variations in run time
are presented graphically in Figures 4-6, 4-7, and 4-8. In the latter three
figures, the same result data are presented in two ways - run time versus mem-
ory speed for constant bus speed, and run time versus bus speed for constant
memory speed.
These results suggest that it is quite reasonable to operate with the bus
“saturated”, i. e. , operating near capacity. When the processors are occasion-
ally waiting for access to the bus, it indicates that the bus is being utilized
fully. On the other hand, the sharpness of run time variation with bus speed
indicates that bus capacity is a crucial resource.
-81-
TABLE 4-3
Memory
Speed
EXAMPLE 1: SIFTING SORT-EXECUTION EFFICIENCY
Bus Speed
fast slow
I/4 l/2 1 2 4
Fast 1 ‘79
2 79 78 76 64 “EFFICIENCY” O/oMicroinstructions/
4 79 76 Area
7 79 78 76
Slow 10 36
Note: 15 processors used in all cases.
-82-
TABLE 4-4
EXAMPLE 2: SHELL SORT-EXECUTION EFFICIENCY
Bus Speed
fast slow
l/4 l/2 1 2 4
Memory
Speed
Fast 1 97
2 91 91 80 “EFFICIENCY’
4 80 79 74 50 $&Microinstructions/Area
7 66 65 50
Slow 10 55 56 48 28
Note: 8 processors used in all cases.
TABLE 4-5
EXAMPLE 3: MATRIX MULTIPLY-EXECUTION EFFICIENCY
Bus Speed
fast slow
Memory
speed
l/4 l/2 1 2 4
Fast 1 98
2 94 93 84 “EFFICIENCY”
4 86 85 81 %Microinstructions/Area
7 75 74 62
Slow 10 66 66 60 37
Note: number of processors used varied from 7 to 13.
-83-
8 1 EXAMPLE I-32
6 \
Note: 15 processors used in all cases
4
2
0 s
L--- &ary Speeds
dow 4 2 I l/2 l/4
BUS CYCLE TIME
FIG. 4-6. Example 1: sifting sort - bus speed dependence.
-84-
X V
r”
I-
:
X
40
30
20
IO
0
1 EXAMPLE 2- IO0 1
Note: 8 processors used
‘\ in all cases
2 BUS CYCLE TIME
l/2 l/4 fast
IO
0 slow IO 7 4
MEMORY CYCLE 2 I fast T I M E z,obB,,
FIG. 4-7. Example 3: matrix multiply - bus and memory speed dependence.
-85-
20
( EXAMPLE 3- 6 1
Note: Number of processors used varied from 7 to I3
I6
I2
8 IO 7
4
0 slow 4 2 I l/2 l/4 fast
BUS CYCLE TIME
I6 04
r
0 ’ slow i0 7
I
I fast MEMORY CYCLE TIME
FIG. 4-8. Example 2: Shell sort - bus ind memory speed dependence.
-86-
3. Main Memory Speed
We find that memory cycle time is quite important for efficiency in Examples
2 and 3 (but not in Example 1, where memory access was deliberately avoided).
A memory cycle time of seven times the processor cycle time yielded an efficiency
of 65% or better in all examples with bus cycle ratio of 1 or less. However, the
efffect on run time is less dramatic. Example 1 (Figure 4-6) did not show any
significant variation of run time with memory speed. The sensitivity of run time
tomemoryspeed is greatest, as may have been expected, for Example 2. Even
including this example, however, the memory speed dependence shows no sharp
features, and we conclude that the Z-machine organization has not introduced any
unusual sensitivity to main memory speed.
-87-
4. Amount of Parallelism
Table 4-9 shows typical parallelism and execution time results for the
three examples. Examples 1 and 3 were able to employ 8 to 9 processors on -
the average, when 15 were available. Example 2 used less than three on the
average. We attribute this to the nature of the Shell sort algorithm as imple-
mented here - the last two passes in the sort can use only two and one pro-
cessors, respectively.
TABLE 4-9
TYPICAL PARALLELISM AND EXECUTION TIME
Example l-32 Example 2 -100 Sifting Sort Shell Sort
Bus =1 Bus =l Memory = 2.75 Memory =4
Example 3-6 Matrix Multiply
Bus =l Memory = ‘7
Typical Average Parallelism
“Allocated’1 8.8 (15) 2.7 (8) 7.8 (11)
Microinstruction 6.7 2.0 5.8
Typical Achieved Execution Time
with maximum num- 3,100 (15) 14,000 (8) 5,900 (11)
ber of processors.
with 1 processor I
24,300 32,900 45,000
Figures 4-l 0 through 4-l 5 present graphically the amount of parallelism
achieved on the average for the three examples, For each example, two sets
of results are shown - one for fast memory and bus speeds and the other for
slow memory and bus. By comparing the two, one can see the extent to which
the hardware parameters affect the parallelism. Since adequate bus capacity
-88-
has turned out to be crucial to effective use of the system, we have shown the
bus bandwidth utilization versus number of processors on the same scale in each
figure. Observe the correlation of bus saturation with a dramatic increase in
wait cycles. Also, observe that for Example 3 (Figures 4-14 and 4-15) the
naximum number of processors in operation at one time decreases from 12 to
7 as the bus and memory speeds are reduced.
The number of processors which can be usefully employed in each example
is dependent both on the hardware parameters and on the coding (the algorithm
implementation). The primary limitation of the Z-machine, in this respect, is
that only one processor can be allocated at a time. Furthermore, typically
each processor must then be loaded with enough state information to determine
whether another processor should be attached. For this reason, even the most
independently computable tasks, as in Example 3 (matrix multiply), cannot all
be started up at once; Example 3 was, therefore, unable ever to use more than
13 processors at a time while computing 36 independent inner products. On
the other hand, this skewing of the computation, causing a corresponding
leveling of demand for processors over time, would be an asset in a multi-
programmed system. (Further remarks on multiprogramming are made in
Chapter 5).
-89-
12
8
4
2
I
100
50
0
I 2 4 8
a a 0
I I
I 2 4 8 I2 I5 NUMBER OF PROCESSORS
FIG. 4-10. Example 1 - parallelism with fast bus and memory.
-9o-
12
8
4
2
I
I 2 4 8 12 I5
I I I I
I 2 4 8 I2 15
NUMBER OF PROCESSORS 2106U
FIG. 4-11. Example 1 - parallelism with slow bus and memory.
-91-
8
4
2
I
0
100
50
0
I 2 4 8 I2 I5
SHELL SORT
MEMORY = 2 ( FAST)
I 2 4 8 I2 I5 NUMBER OF PROCESSORS 2,068.
FIG. 4-12. Example 2 - parallelism with fast bus and memory.
-92-
H cn -
12
8
4
2
I
0
100
50
0
SHELL SORT
I 2 4
_-------__---_--___-____-_-_-----_
I 2 3 8 I2 15 NUMBER OF PROCESSORS
FIG. 4-13. Example 2 - parallelism with slow bus and memory.
-33-
I2
8
4
2
I
0
d ,>:.. .: :.: .:.:.: :p:: ::>> . .,.
EXAMPLE 3 - 6 MATRIX MULTIPLY
BUS= I (MEDIUM)
/
e
MEMORY= 2.125 (FAST) cyo
xc, 0
I 2
100 r ~_-~-----------__------~-~- __-__ --
0 ’ I I I I I I 2 4 8 I2 I5
NUMBER OF PROCESSORS 2lObW
FIG. 4-14. Example 3 - parallelism with medium bus and fast memory.
8
4
2
I
0
100
50
0
I
/
MATRIX MULTIPLY
2 4 8 I2 I5
-----
l 2 4 8 I2 I5
NUMBER OF PROCESSORS 210611
FIG. 4-15. Example 3 - parallelism with slow bus and memory.
-95-
5. Run Time Versus Number of Processors
Figures 4-16, 4-17, and 4-18 show how the run time of the three examples
varies with the number of processors available in the system. (For each example,
results are shown for the same pair of fast and slow bus and memory speeds as in
the previous six figures. ) In each case, the run time for one processor is taken
as 1.00, and the rest of the run times are normalized so that the shapes of the
curves are commensurate.
We see again in Figure 4-16 that the characteristics of Example 1 are
relatively independent of machine parameters, and in Figure 4-17 that Example 2
is quite sensitive to them. It is also again clear here that Example 2 does not
make good use of more than 2 processors. We note also the apparent anomaly of
the 3-processor case in Figure 4-17, which is due to the requirement of the
Example 2 algorithm for four processors on the third-to-last pass, then two pro-
cessors on the next-to-last pass, and finally one processor on the last pass. The
third processor does not, by itself, add nearly as much to the computation as do
a third and fourth together.
Finally, we see the smooth variation of run time for Example 3 in Figure
4-18, but with variation with bus and memory speed of the maximum number of
useful processors.
The value of the decrease in run time with increasing numbers of processors,
expressed as cost per throughput, is investigated below.
6. Cost-performance Analysis
Below, we develop a simple cost model for a multiple-identical processor
computing system and analyze the cost per throughput to determine the optimum
-96-
tt I I I I
9 - c9 0
T 0 0
-97-
9 a!
c4 T
cu 0
- 0
0 0
6 3W
ll N
nu 3AllVl3kl
-98-
I I
0 oc!
c9 T
cd 0
. -
0 0
0 d
3WIl
Nrltj
3AIlVlzki
-99-
number of processors for each example. In this model, the maximum number
of processors available in the system is regarded as the cost basis - the single task assumption. This still leaves open the possibility of further cost reduction
by multiprogramming.
We assume that the processors, being identical, all cost the same amount,
p. The cost of other system components- bus, memory, peripherals, etc. ,-are
to be absorbed in the cost, S , of the system without processors. The object of
the following analysis is to answer the question: Given that I intend to purchase
a computer system of the type described, which can accommodate a number of
processors ranging from 1 to a maximum, say Kmax , how many processors
should I by? This question is answered here with figures for system cost divid-
ed by throughput-where throughput is naively defined as the inverse of job run
time.
C=costofsystem=S+k*p
where S = cost of non-processor components
p = cost of a processor
k = number of processors in system
Normalized cost = i = 1 + $ k
Run time with k processors = Tk Tk
Normalized run time = - T1
Throughput = l/Run time = TI/ Tk
Mk = (cost/throughput) with k processors = (1 + f * k). Tk
0 - T1
where f = 3 = processor cost non-processor components cost
MI = cost/throughput with 1 processor = 1 + f
normalized cost/throughput = 2 =(1;:;f). g)
-1 oo-
Mk For curves of constant cost, (letting f vary), we set M = 1 which yields
1
Tk 1+f -= I+k.f T1
(2)
Equation (2) above gives a family of constant-cost curves, which are shown
in Figure 4-19 for several values of f. This family of curves allows us to answer
the question: If a job runs with given (unit) speed with a single processor, how
fast does it have to run with k processors in order to justify purchasing those k
processors? The answer is given by locating on Figure 4-19 the point corres-
ponding to the ratio of run times and the number of processors. If the point lies
below the constant-cost curve corresponding to the marginal cost of a processor,
then the purchase decision would be made.
Figures 4-20, 4-21, and 4-22 show the cost per throughput versus number of
processors for three values of assumed processor cost. The minimum cost points
are summarized below in Table 4-23.
TABLE 4-23
NUMBER OF PROCESSORS FOR MINIMUM COST/THROUGHPUT
Processor Cost Assumption Example 1 Example 2 Example 3
f = 0.025 15 8 11 _
f =O.l 15 4 11
f =0.25 8 2 8
These results are for the “@picalf settings of bus speed = 1 and a moderate
memory speed. We conclude that if the cost of a processor remains t%malltt-even
as much as 250/C of the non-processor components cost-that these examples have
achieved economical usage of multiple processors.
-lOl-
u-l -
cu In
- d y
0 o
e-8
0
-
- d II - - co
* N
e N
0
co co
e N
0
- -
- d
d 6
d
3Wll
Nlltl
3AIlVl3~ x
IS03 W
31SAS
-103-
I .4
r” 1.2 F
5 1.0 IT W
L G 0.8
ii K 0.6 X
k g 0.4
22 If 0.2 Y u-l
-0 I I I I
I2 4 8 I I I5 NUMBER OF PROCESSORS ,10&A,’
FIG. 4-21. Example 3 - cost curves.
9 0 II Y-
N
d: _’
-’ 0 m
W
T
&I 0
- d
6 0
d
3Wll
Nfltl
3AIlVl3ki X IS03
W31SAS
- - al
d-
N
-105-
“The prediction of future progress requires technological sophistication, some ‘inside’ information about the research currently in progress, and a large amount of courage. 1’
William F. Sharpe The Economics of. Computers, 1969
CHAPTER 5
UNEXPECTED RESULTS AND POSSIBLE EXTENSIONS
Results From Execution Trace
In addition to providing a direct and effective means of detecting and cor-
recting errors in machine design and microprograms, the execution trace output
leads to observationof unexpected system behavior that would remain undetected
in an analytical or less detailed model.
Figures 5-l and 5-2 are samples of the execution trace generated by the
Z-machine simulation code. (This trace output is constructed by user-written
simulation code. ) Figure 5-l shows the beginning of execution of a run of Example
1, the sifting sort. Note the data bus transactions-first the cold start FORK and
its acknowledge cycle, then the ‘half-read” cycle initiation transmission to
memory location 0000, and so on. In the right half of the figure, which is the
continuation, the first processor generated FORK occurs, and we see processor
2 start up at microstore location 009 (underlined).
In figure 5-2, a section of the trace of Example 3, matrix multiply, is shown.
At the top of this figure, the execution of seven of the active processors is traced.
Processor 10 (hex ‘A’) is also active, but not shown. Processor 1 has finished
one inner product and has stopped; in the middle of the figure, we see it being
attached again as %lavet’ to processor 10 (boxed). Note also the heavy but not
completely saturated usage of the bus.
-106-
I ‘I I I I I I
‘I I
‘I ‘I
‘I I I
!I I
I I I
I
,I I
II
‘I
‘I ‘I
!I I I
I
I
I I
II :; ,I I I
,I
I
I
I
I
i I
I I I
I
I
I
I
I
I
: I
I
I I I
. . . .
. . . .
. . . .
. . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
#<-I nEnH ccc . . . . . . . . . . . . . . . . . . . .
la->, OATA CC16 . . . . . . . . I::: ::: . . . . . . . . I$ ::’ . . . . IC4C SR . . . . IC4L 1E! . . . . IC4F SK? . . . . IC5C LR
109 . . . . 110 . . . . 111 . . . . l&Z . . . . 113 . . . . II4 . . . .
FIG. 5-l. Sample of execution trace from Example 1.
i13z 1133 ,134 ,135
. . . .
. . . .
*ens . . . . MEHF
FIG. 5-2. Sample of execution trace from Example 3.
Figure 5-3 shows a portion of the trace from Example 2, the Shell sort. In
this run, the bus cycle time is four times the processor cycle time, and the bus
is completely saturated. As a result, we see many waits for both bus and
memory access. (Due to the cycle ratio, only one of each four microinstructions
is printed. ) In this trace, we discovered a phenomenon which had not been an-
ticipated. The peculiarly regular pattern of memory cycling is an undesirable
effect. The columns of “RD” here represent cycles during which a memory bank
has accepted a store request but has not yet received the data to be stored. The
regular patttern results from the coincidence of the four memory requests and
from the nature of the bus priority mechanism. Bus access is given to processors
on a cyclic basis: the processor which is grantedaccess on one cycle has the lowest
priority for access on the next cycle. We see here that this causes the store and
data transmissions from any one processor to be separated whenever another
processor is waiting for access. Serious deterioration of memory utilization
could have resulted, in the Figure 5-3 trace, if more than these four processors
had been requesting bus access (not necessarily to memory). A solution would be
to promote the bus access priority, for one transmission, of a processor which
has transmitted a.memory store or modify command.
Processor-to-bus Buffering
The processor has a buffer register in which ouput to the bus is held while
waiting for access. How much time is gained by allowing the microprogram to
proceed while waiting for bus access? We have rerun Example 3, matrix multiply
(bus = 1, memory = 7), with this output buffering disabled. The results are com-
pared in Table 5-4 below.
-109-
I I
i I I I I I I I I I
I I I I
I
I
I I
I I I I
I I I
I I I
i I
I I I
FIG. 5-3. Sample of execution trace from Example 2.
TABLE 5-4
COMPARISON OF RUNS WITH AND WITHOUT BUS BUFFER (EXAMPLE 3)
Run Time
ocessors
-lll-
The change in run time indicated that bus output buffering is not crucial under
the run conditions of this example.
Program “overhead”
After the three examples here were coded, the following question was raised:
How many of the microinstructions executed would be unnecessary if the problem
were being run on a single processor? In other words, how much program over-
head has been introduced by coding the algorithms for multiple processors, with
all of the necessary coordination? To answer this question, we examined the
microcode and counted the microinstructions which are devoted to coordination
(or would be unnecessary if we knew that only one processor were available) and
multiplied by the number of times that these microinstructions are executed in
the particular example cases run. The results, expressed as a percentage of the
total number of microinstructions are shown below in Table 5-5.
TABLE 5-5
Example 1 Example 2 Example 3 Sifting Sort Shell Sort Matrix Multiply
“overheadfl in One-Processor Case (Microinstr. ) 51% 3.6% 11%
Example 1 overhead seems unexpectedly high until we consider that the algo-
rithm, the sifting sort through a chain of processors, was designed for a many-
processor situation; the algorithm is intended to reduce memory references by
using large numbers of processors. The one-processor case here reduces to the
traditional bubble sort. The microprogram for Example 1, furthermore, is highly
“folded”- there is only one main loop for the three states possible for a processor.
This loop, therefore, has many status tests for alternate subpaths. All of these
-11%
tests would become unnecessary in a one-processor case.
Example 2 overhead is quite low, reflecting the relatively small amount of
parallelism in the implementation of the algorithm. Example 3 overhead seems
reasonable. The possible overhead was, in fact, taken into consideration when
the decision was made to attempt a FORK after each multiply and sum step in the
inner product loop. (See the simplified flowchart, Fig. 34, and discussion in
Chapter 3). We conclude that the microprogrammer in general has control over
the coordination overhead in that he can trade off some of the many-processor
parallelism for efficiency with one processor.
Data Bus Design
We have seen that the common bus used in the Z-machine is a crucial resource
It appears that the bus is the limiting factor in determining how many processors
can be adequately supported in the hardware organization. Several alternatives
are proposed in the “design alternatives” section below.
For large numbers of processors, say 100 to 1000, it is clear that some com-
pletely different approach is required. One direction of current research
(M. L. Schlumberger, private communication) into a balanced scheme of inter-
processor communication has suggested the following. Each processor would be
located at a node which has two incoming and two outgoing paths. A proposed inter-
connection scheme then guarantees that a message from any one node can reach any
other node in time log2N, where N is the number of nodes. The scheme includes a
straightforward address-transformation method of routing the messages. In ad-
dition, there are multiple paths between nodes, so that a failure in one link only
degrades the transmission time of some messages.
Such a scheme obviates two of the major drawbacks of the common bus in the
Z-machine. The whole system is vulnerable to failure in the bus; and the common
-113-
bus approach will always impose a limit on the number of processors which can be
attached.
Hardware Design Alternatives
The following suggestions cover alternatives which were considered-and re-
jected for this study, and some further possibilities which should be investigated
in the development of a viable complete system.
1. As long as there are as few as fifteen processors are in the system, a
crossbar switch may be economically justifiable for processor-to-
processor and processor-to-memory communication. The cost, however,
rises sharply with larger numbers of processors.
2. An obvious extension of bus capacity would be provided by using a
separate bus for processor-to-memory transactions.
3. The Z-machine processor microinstruction set does not contain any
microinstructions which are particularly helpful in decoding and
interpreting the “machine-language” instructions. For example, an
efficient means of performing a many-way branch would assist in
writing emulation routines. Also, instructions which use masked
portions of a register as an index would be helpful.
4. In coding the examples here, all variables which could not be held
in processor registers were stored in main memory. By using read-
write microprogram storage and adding a “store local” microinstruction,
we could use this “local” storage for such temporary variables. By
introducing writable microprogram store, we open up many problems
of microprogram protection, but these might be worth facing in return
for a possible speed advantage.
Software Alternatives and Extensions
The following ideas relate to extending the parallelism we have achieved to
-114-
higher programming ‘levels”.
1. The concept of multiprogramming in the conventional sense remains
valid’ in the context of the Z-machine organization. Competition for
processors at the lowest level, while it may not be under the control
of an operating-system-level scheduler, should not prevent successful
utilization of otherwise idle resources. In the economic analysis in
Chapter 4, the cost of all processors available, even when unused,
was charged to the “user1 program. Multiprogramming could, there-
fore, improve the resulting cost figures by recovering some of the un-
used cycles. As a direct extension of the work presented here, we
would be interested to run two or more of the microprograms con-
currently, observe the competition for resources, and watch for
potential dominance of one microprogram over another.
2. No provision in this model has been made for input/output processing.
It should be possible by a straightforward extention of the Z-machine micro-
processor to create an I/O processor that should have communication
paths to external devices. This and other approaches to I/O handling
should be explored.
3. A serious challenge in developing a viable multiprocessor system is to
have a well integrated operating system. We have briefly in the past
(Levy, J. V. , “Modules for Operating Systems and Computer Organization”,
Computation Group Technical Memo , Stanford Linear Accelerator Center,
August 19, 19 70) worked on developing some of the primitive operations
which would handle the queuing tasks of an operating system. Further
development of such operations in the context of a multiple microprocessor
system would be of value.
-115-
4. Once the interpretation cycle for some machine language has been
written in Z-machine microcode, there are extensions that can be
considered. One intriguing possibility is to introduce some (machine-
language) instruction overlap by using more than one processor in the
fetch-decode portion of the fetch-decode-execute cycle. To our know-
ledge, this would be the first instance of introducing instruction over-
lap and parallelism solely by microprogramming (i. e. , without modifying
the hardware).
5. High-level language parallelism by means of FORK and JOIN statements
or their equivalent has been proposed in many places. Implementation
of any well-defined set of corresponding machine language instructions
would be straightforward using the microcode level FORK, DETACH,
and STOP instructions. It would also be possible to define a variety of
machine language constructs which would give the programmer control
of the microprocessors directly. With this control, however, would come
the responsibility for assuring that the programs are deadlock-free.
6. The problem of performing a computation under real-time constraints
becomes more difficult with this multiple microprocessor model. Devel-
opment of hardware or software modifications to guarantee allocation of
processors or to give priority to certain programs would be worthwhile.
-116-
APPENDIX A
The CISCO Simulation System
The CISCO simulation system (J. V. Levy, “A Simulation System for Computer
Design”, Computation Group Technical Memo 113, Stanford Linear Accelerator
Center, January, 1971) is composed of three software packages, each written in
PL/I.
The CISCO compiler (Jim Low, “The CISCO Complier”, Report for Computer
Science 239, Stanford University, April 1, 1971; and Jim Low, WISCO Language
Reference Manual, Computation Group Technical Memo 114, Stanford Linear
Accelerator Center, December 1970) accepts a description of a machine to be
simulated, written in a PL/I-like high-level language. Its output is loader text
for the KID interpreter.
The CISCO assembler (J. V. Levy, “A One-pass Assembler for CISCO”,
Computation Group Technical Memo 135, Stanford Linear Accelerator Center,
July, 1972) takes “Opdef’‘-type instruction format specifications for a simulated
machine and assembles code for that machine. The output is initialization data for
memory in the simulated machine.
The KID interpreter (J. V. Levy, “KID- Interpreter and Loader Reference
Manual”, Computation Group Technical Memo 115, Stanford Linear Accelerator
Center, November, 1970) loads the machine description and initialization data and
then runs the simulation by interpreting the low-level code produced by the compiler.
The low-level operation of the interpreter is essentially a stack machine which
executes a series of functional blocks as co-routines.
-117-
APPENDIX B
Detailed Flowcharts, Data Values, and Assembler Listings
FIG. B-l. Z-machine definitions for assembler.
-iia-
Lh *E; c.N CL II, 13, (8, (81, * f UC * *
t * I
l R”S =TIK
0000 “00000002 *
*MEMO =TIK,
0000 “00000002 iT,KZ
0000 “0”000”08
HASTER SLAVE “EMORY NC, MASTER NOT SLAVE NOT HEHOR”
TIClE CCFlTHCL SETTIffiS
PROC ME co REG TIKI UC 2 MEMTiHE, RE‘ TI62 0‘ 8 MEMTlH.52
87. 88. 89. 90. 91. 92. 93. wt. 95. 96. 97. 98. 99.
LOO. 101. 102. 103. 104. 105. *ix. LOT. 108. 109. 110. 111.
FIG. B-l. (continued).
-119-
l * * * I *
*
*!4EMD -*EMDRY
DDOD “DDDDDDZD ODD1 “DDDDDDlC ODD2 “ODDDDOlb DO03 “OOODDO,4
5004 “ODDDDD1D 0005 “DDDOODDC 0o”b “DDODDDDB
DOD7 “oDDoDDD* l b!EMl =“EM”RY
DDDC “DOODDD,F DOD1 “ODDDDDLB 0002 "DDDDDDLT ODD3 "DDDDDDL3 ODD4 "DDDDODOF 0005 "DDOOODDR DDDb "OODDODD, ODD7 YDDODDDD3
*HEM2 =t4EMDR"
OODC "DODOODLE DDDL YDDDOODLA oooi! "DOODDDLb ODD3 "000000L2 DOD4 "DDDDDDDE 0005 "DOODDDO~ ODD6 "DDDDDOOb DOD7 "DDDDDDDZ
*HEM3 =ME”ORY
ODDC “DDDDDDlO DDO, “DDDDDD19 DOD2 YDDDDOD15 0003 “DDD”DDLl DOD4 “DODOODDD 0005 “DDDDODO9 DO06 “ODDDDDDS DO07 “ODODDDO,
l *
*MICROHEPI =STDRE
* l
*
*
R
PROC MEnO REG NEnORY DC 32 DC 28 DC 24 DC 20 DC I6 DC 12 DC 6 DC Q PRDC MEtI1 REG MEnDRY DC 31 DC 27 DC 23 DC 19 DC 15 DC 11 DC 7 DC 3 PROC nEn2 REG "EMORY DC 30 DC 26 DC *2 “C LB DC 14 DC 10 DC 6 DC 2 PRDC f4EY3 REG MEWXY OC 29 DC 25 DC 21 DC 17 DC 13 DC 9 DC 5 DC 1
PROC REG
196. 197. 198. 199. 200. 201. 202. 203. 204. 205. 206. 207. .?Ob. 209. 210. 211. 212. 213. 2,4. 215. 2L6. 217. 2LB. 219. 220. 221. 222. 223. 224. 225. 226. 227. t28. 229. 230. 231. 232. 233. 234. 235. 236. 231. 238. 239. 240. 241. 242. 243. 214. 245. 246. 247. 248. 249. 250. 25,. ncav4
FIG. B-2. Example 1 - assembler listing.
-120-
252 253 255 255 256 25, 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 27, 278 2,9 280 281 282 283 284 285 286 28, 288 289 290 291 292 293 2% 295 296 25, 298 299 300 301 302 303 304 305 306 307
OODC DO01 “002 0003
0004 0005 ODD6 0007 0000
0009 OOOd 0006 oooc 0000 DOOE OOOF OOlC 0011 ODLZ OOL3 0014 0015 0016 0017 OOl8
“Doooooco “00000078 “0000000F “0000007c
“oooooocl “00000079 “ooooooco “oooooo*F “00000058
“0000000, “00000079 “0000000, “00000078 “0000000 L “0000007c “00000048 “00000070 “ooooooco voooooo7A “DODODO49 “000000,0 “00000020 “0000004D “00D0000‘ “0000000,
l * l
l l
SL”
“Il. 2 S-1 STORING IN “EIIORY, -0 TO SLAVE “IL 3 INITIAL ADDRESS, ALSO INSERT AODRESS “AL 4 FIN&L IDDRESS “AL 5 RUNNING ADDRESS CDUNTER “AL 6 DPERlNO 1 V&L 7 OPERlND 2 IALVAYS CAUSE M-01
FORYARD REFERENCES
“AL DOF “AL OZE V&L 076 “AL OLD VAL ObC VAL OBF “*L 042 “AL 037 V&L OZA “AL 044 “AL OK “At. 096 VAL 009
SLbVE STARTS HERE R SET R TO VALVE OF 5 FROM MASTER
I NY ,NSERT ADDRESS
F FINII. LODRESS I G lN,T,lL,ZE THE RUNNING ADDRESS 0 5 5 SE, TO 0, ND STORES DONE YE, R
EP G R-.-O. FETCH FRDH “EMORY _..
MEW YIN WAIT FOR INPUT FROM MASTER OR HEMDRY
252. 253. 254. 255. 256. 257. 258. 259. 260. 261. 262. 263. 264. 265. 26.5. 26,. 268. 269. 2,o. 27,. 272. 273. 274. 2,s. 216. 27,. 278. 279. 280. 28,. 282. 283. 284. 285. 286. 28,. 288. 289. 290. 291. 292. 293. 294. 295. 296. 29,. 298. 299. 300. 30,. 302. 303. 304. 305. 306. 307. 106127
FIG. B-2 (continued).
-121-
308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 33-l 338 339 340 341 342 343 344 345 346 347 340 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363
0019 V0000007E STR OOlA VOOoOOocl LA1 0010 VOGOOOOAE AI OOIC voooooo58 R
0010 V0000004A OOlE v00000010 OOlF VOOOOOO36 0020 vooooooc 1 002 1 VOOOOOOAA 0022 voooooo58
0023 0024 0025 0026 0027 002a 0029 OOZA OOZH ooze 002rJ 502E OOZF 0030 0031 0032 0033 0034
voooooo2c v00000001 VOOOOOO4F VOOOOOOOA VOOOOOOFF VOOllOOO7A voooooo37 V00000040 VOOOOOOOE VOOOOOO4F voooooooB voooooo55 voooooo49 v00000010 VOOOOOOZE vooooooc2 VOOOOOOAZ voooooo58
SKI WIN LR OSL LI STR SK4 LR
#EMS LR DHEM INCR LH TEST SK3 LA1 AI a
0035 vooooooco 0036 vooooooo9 0037 v00000001 0038 VOOOOOOlc 0039 v00000038 0033 V0000007A 0038 v000000c1 003c vooooooB7 0030 voooooo58
+
LB
*
LI OMA WIN TIN SK4 STR LA1 AI B
003E VOOOOOC7F STR 003F vooooooc2 LA1 004c VCOOOOOAC AI 0041 voooooo58 B
0042 v00000040 0043 voooooooc 0044 v00000001 0045 v0000001c CO46 V0000003B 0047 V0000007A 004.9 vooooooc2 0049 VOOOOOOA4
EXAMPLE 1 51 FT ING SORT (14 APR 72) CISCO ASSEMBLER
+ + L4
c
L9
L2
* l
L7
LlO
LR TEST SK3 LA1 AI B
LR HEHH WIN 1 IN SK4 STH LAI AI
d L2115 L2
S
LE L9//5 L9
EP
B
-1 S LE G
B
G R
EP L7//5 L7
0
-SL 5 LB//5 LB
B L11//5 Lll
G
-.SL 5 L LO//S LIO
FIRST OPERAND
GO STAF.T UP
S>O, GO STORE R IN MEMORY
308. 309. 310. 311. 312. 313. 314. 315. 316. 317. 318. 319. 320.
IF S<O, WAIT FOR INPUT FROM SLAVE s=o : SEND B TO SLAVE
321. 322. 323. 324. 325.
SET 5 TO -1, INDICATES WAlTlNG FOR SLAVE IN 326. IALWAYS SKIPS) 327. S-0; STORE B AT G 32R.
HOVE UP THE ADDRESS CflUNTER
R-.-O, GO FETCH FROM MEMORY
R=Ot SEND SIGNAL TO MASTER
WAIT FOR INPUT IFROH MASTER1
FROM MASTER? NO: FROM SLAVE: SET S WAIT FOR MASTER
NEXT DATUM GO TO TFST AND COMPARE
FETCH FROM MEMORY
FROM MEMORY?
NO, FRflM SLAVE: SET S WAIT FOR MEMORY DATUM
329. 330. 331. 332. 333. 334. 335. 336. 337. 338. 339. 340. 341. 342. 343. 344. 345. 346. 347. 348. 349. 350. 351. 352. 353. 354. 355. 356. 357. 358. 359. 360. 361. 362. 363.
FIG. B-2 (continued).
‘- 122-
EXAMPLE 1 51 FTING SORT I14 APR 721 CISCO ASSEMBLER
364 365 366 361 368 369 370 371 372 373 374 315 376 317 318 379 380 381 382 383 384 385 386 301 380 309 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419
004b voooooo58
0048 VOOOOOO7F 004c VOOOOOO4F 0040 VOOOOOObb OOQE v00000010 004F V0000003A 0050 V0000004E 005 1 VOOOOOO5F 0052 V0000007E 0053 voooooo4c 0054 VOOOOOOb5 0055 voooooo 10 0056 V00000026 0057 V000000c3 0058 VOOOOOOBb 0059 voooooo58
STR B LR B SR A TEST SK3 GE LR A EXR B STR A LR F SR G TEST SK3 CT LA1 L3/f5 AI L3 B
* OOSA VOOOOOOlA TSL 0058 VOOOOOOZE SK3 EP 005c VOOOOOOcO LAI L4ff5 0050 VOOOOOOBO Al L4 005E voooooo58 B
* 005F VOOOOOOcO LA1 SLVff5 OObC VOOOOOOA9 AI SLV 0061 vooooooo4 FORK 0062 VOOOOOOlA TSL 0063 VOOOOOO2E SK3 EP 0064 vooooooc3 LAI L5ffS 0065 VOOOOOOAC AI L5 0066 voooooo58 8
l
0067 v000000c1 LI 1 0068 VOOOOOO7A STR S 0069 vooooooco LA1 L4ff5 OObA V00000000 AI L4 0060 voooooo58 B
OObC 0060 OObE OObF 0070 0071 0072 0073 0074 co75
0076 VOOOOOOlA oc77 V0000002E 0078 vooooooc4 0079 VOOOOOOAF
V0000004A VOOOOOOOA V00000048 v00000011 VOOOOOOOA V0000004c VOOOOOOOb vooooooco V000000i30 VOOOOOOSB
l
Lll
* * L5
* t L3
B
LR S OSL LR I INCA OSL LR F OSL LA1 L4ff5 Al L4 B
TSL SK3 EO LA1 Lb/f5 AI Lb
MEMORY DATUM
364. 365. 366. 36-f.
B-A:0 368. 369. 370.
EXCHANGE IF A>B 371. 372. 373. 374.
F-G:0 315. 376. 377. 370. 379.
G=F; GO TO FINISH 380. 381.
GuF: IS A SLAVE ALLOCATED? 382. 383.
YES, GO OIRECTLV TO OUTPUT 384. 385. 306. 307.
NO, TRY TO GET A SLAVE 388. 389. 390.
OiO I GET ONE? 391. 392.
YES, GO SEND lNllIALIZATION TO SLAVE 393. 394. 395. 396.
NO, SET S TO 1, INDICATES STORING HAS BEGUN 397. 398.
GO 00 OUTPUT
INITIALIZATION FOR SLAVE S BECOMES SLAVE’S R
I+1 BECOMES SLAVE’S I
F 8ECOMES SLAVE’S F
GO 00 OUTPUT
FINISH: SEE IF I AM THE BOTTOM PROCESSOR
NO: GO FINISH
399. 400. 401. 402. 403. 404. 405. 406. 407. 408. 409. 410. 411. 412. 413. 414. 415. 416. 417. 410. 419. 2lodAlv
FIG. B-2 (continued).
-123-
EXAMPLE 1 SIFTING SORT I14 APR 721 CISCO ASSEMBLER
420 421 422 423 424 425 426 427 420 429 430 431 432 433 434 435 436 437 43B 439 440 441 442 443 444 445 446 447 440 449 450 451 452 453 454 455 45h 457 450 459 460 461 462 463 464 465 466 467 468 469 470 411 472 473
CO7A voooooo5B
0070 007c 0070 007E 007F 0080 0081 0082
voooooo4rJ LR G VOOOOOOOE MEMS VOOOOOOCF LR 0 voooooooB OMEN VOOOOOO48 LR I VOOOOOOOE HEMS V0000004E LR A voooooooB OMEH voooooo53 INCR I voooooo4B LR I VOOOOOOb4 SR F v00000010 TEST VOOOOO029 SK2 LT vooooooo6 OET voooooooo STOP
* B 420.
0083 OCB4 0085 0086 0087 OCR8 OOB9
* 008A VOOOOOOc 1 LI 1 OORB voooooc79 STR R 008C vooooooco LA1 Llff5 0080 VOOOOOOAF AI Cl OOBE VOOOOOO5B B
421. 422. YES: PUT AWAY B 423. 424. 425. 4?6. 427. 420. 429. 430. 431. 432. 433. 434. 435. 436. 437. 438. 439. 440. 441. 442. 443. 444. 445. 446. 447. 44R. 449. 450. 451. 452. 453. 454. 455. 456. 457.’ 458. 459. 460. 461. 462. 463. 464. 465. 466. 46-I. 468. 469. 470. 471. 472.
STORE RESULT AT I
SEE IF BORE IS TO BE DONE
I :F
NC MORE, PUIT
MORE; SET R TO CAUSE FETCH
GO START AGAIN
* * Lb OOBF VOOOOOO4A
0090 voooooo 10 0091 V00000036 0092 vooooooc4 0093 VOOOOOOBA 0094 voooooo5B
0095 voooooo2c 0096 v00000001 0097 VOOOOOO4F CO98 VOOOOOOOA 0099 voooooo37 009A v00000040 009B VOOOOOOOE 009c VOOOOOO4F 0090 v00000006 009E VOOOOOO4B 009F VOOOOOOOE OOAO V0000004E OOAL voooooooB OOA2 VOOOOOO06 OOA3 voooooooo
LR S FINISH
S>O, GO STORE IN MEMORY
TEST SK3 LE LA1 L12ff5 AI L12 B
* SK1 EP WIN ;‘;, WAIT FOR INPUT FROM SLAVE
= : SEND B TO SLAVE
I ALWAYS SKIPS) S-0; STORE B AT G
LR B OSL SK4 LE LR G L12 HEMS LR B OMEH LR I STORE A AT I HEMS LR A OMEM DET STOP THAT’S ALL
* * *
tRUN TEXT XRUN XSNAP TEXT ISNAP %STOP TEXT tSTOP
FIG. B-2 (continued).
-124-
196 197 19B 199 205 LO1 202 2b3 204 235 ZUb 2u7 2U8 2C9 210 211 212 213 214 215 216 217 210 2LY 22Ll 221 222 0014 223 0015 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 24C 241 242 243 244 245 246 247 246 243 250 251
0030 0001 0002 0003 0094 ocl35 0006 cao7 0000 0009 OljOA 0000 OOOC OGJO OOOE OGOF 0010 3011 0012 0013
OOLb 0017 0018
0040
0080
3030 0001 3932 OJJ3 3034 0005 0096 OOb7 ooofl
* * * *
*HEnO =REWORY VOOOOOOb3 VOOOOOO5F voooooo5B voooooo57 voooooo53 VOOOOOO4F VO300004B v0000c347 v30000043 VOOOOOO3F voooooo38 voooooo37 voooooo33 VOOOOOO2F VO’lOOOO2B VOODOO027 VO3OCOO23 VOOOOOOLF VOuOOOOlB vo3ocoo17 v00000013 VOOOOOOOF VO3OOOVOB voooJooc7 vooooooo3 =MEHORV 5 64 VFFFFFFFF =HEMORV S 126 v00000130 *MEML =FIERORY VOOOOOOb2 V0000005E VOOOOOC5A VOOOOOO56 VOGOOOO52 V0000004E V0300004A VO0050346 VO~~OOOOCZ VOOOJ003E V0000003A vooococ36 VO0000032
0009 OOOA OOJa oooc oouv VOOOOOO2E OOOE VOOOOCO2A OOGF VOOOOOO26 0010 vooooo:)22
EXAMPLE 2: SHELL SORT (12 APR 721 C ISCO ASSEMBLER
INITIALIZATLON OF MAIh MEMORY
196. 197. 198.
PROC MEMO REG MEMORY DC 99 DC 95 DC 91 DC 8-t DC 83 DC 79 DC 75 DC 71 DC 67 oc 63 DC 59 DC 55 DC 51 DC 47 DC 43 DC 39 DC 35 DC 31 DC 27 DC 23 DC 19 DC 15 DC 11 DC 7 DC 3 ORG OlOOff2
199. 200. 201. 202. 203. 204. 205. 2ct. 2C7. 208. 209. 21c. 211. 212. 213. 214. 215. 216. 217. 218. 219. 220. 221. 222. 223. 224. 225. 226. 227.
DC -1 DESCRIPTOR: ORIGIN OF TABLE 226. ORG C2OOff 2 229.
DC 0100 FIRST STACK ENTRY: POINTER TO DESCRIPTOR 2tO. PROC HEM1 231. REG MEMORY 232. UC 98 233. DC 94 234. DC 90 235. DC 86 236. UC a2 237. UC 70 238. UC 74 239. DC 70 240. DC 66 241. UC 62 242. DC 58 243. DC 54 244. DC 50 245. DC 46 246. DC 42 241. DC 38 248. DC 34 249.
FIG. B-3. Example 2 - assembler listing.
-125-
EXAWLE 2: SHELL SORT 112 APR 721 C ISCO ASSEMBLER
252 253 254 255 256 257 25a 259 2bJ 261 262 21~3 204 ib5 266 261 268 269 27U 271 272 273 274 275 276 277 270 279 289 281 282 263 La4 2a5 286 201 288 2a9 290 291 292 293 294 295 296 297 298 299 300 3cJl 3G2 303 3ti4 305 3L;b 307
0011 OOli! 0013 0014 0015 0016 0017 0018
VOOOOOOlE VOODOOO1A VOOOOOOl6 v00000012 V0000003E VOOOOOOOA vooooooo6 vooooooo2 =HEMORY 5 64 VOOOOOO64 *HEM2 =HEHORY V00000Ubl V0300005D v03030059 voooooos5 voooooo51 V0000004D v03000049 v03000045 voooooo41 V30000030 voooooo39 voooooo35 v130000031 v00000020 VOOOOOO29 VO7000025 v00000021 voooooolD vooooocl19 v00000015 v00000011 VOOOOOOOD v00000009 vooooooo5 v00000001 *MEH3 =HEHORY VOi)OOOObO v11)00005c voooooo58 v00000054 v00000050 v0700004c voooooo48 v00000044 v00000040
DC DC DC
30 26 22
250. 251. 252. 253. 254. 255. 256. 257. 258.
DC DC
18 14
DC 10 b 2
ClOO//Z
DC DC ORG
100 HEM2 MEMORY
0040 DC PROC REG DC DC DC DC DC DC
DESCRIPTOR: LENGTH Oii TABLE 259. 260. 261. 262. 0000
5031 0002 0003 OUO4 0005 0006 0007 0008 ooc 9 OUOA 0030
263. 264. 265. 266. 267. 26.3. 2t9. 27C. 271. 272.
DC DC DC DC DC DC
73 69 65 61 57 53 49 45 41 37 33 29 25 21
273. 274. 275. 276. 277. 210. 279. 280. 281. 282. 283. 284.
OGOC OrJO OGOE OOOF 001a 0011 0012 0013 DU14 0015 0016 OG17 OOltl
DC DC DC DC DC DC DC DC
E 17 13
9 5
DC DC 205.
2.96. 207. 288. 289. 290. 291. 292. 293. 294. 295. 296. 297. 298. 299. 300. 331. 302. 303. 304.
OC 1 PROC MEH3 REG MEMORY DC 96 DC 92 DC 88 DC 04 DC 80 DC 76 DC 72 DC 68 DC 64 DC 60 DC 56
0000 000 1 oocl2 0003 0004 0005 OOJ6 0007 ootitl 0009 voooooo3c OGJA voooooo38 oool3 voooooo34 oooc voooooo 30 OOOD vo 100002c OODE v00000020 003F VOOOOOO24
DC 52 DC 40 DC 44 DC 40 DC 36
FIG. B-3 (continued).
-126-
308 3u9 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 947 348 349 350 351 322 353 354 355 356 357 358 J5? jbJ 301 362 3b3
OGlO 0011 0012 0013 0014 0015 OGlb 0017 OOlfl
EXAMPLE 2: SHELL SORT (12 APR 721 C tSC0 ASSEMBLER
v00000020 v0000001c vooooool8 voooooo14 v00000010 voooooooc vooooocoe vooooooo4 voooooooo
DC DC DC oc
2 DC DC DC
* l
32 26 24 20 lb I2
a 4 0
* * LAYOUi OF STACK AND DEXRIPTOR: * + STK -----> D -----> BASE -----> . . . * P N TABLEIlJ * R TABLEl2J * .? . * TN . * TABLEIN) * * * REGISTER ASSIGNMENTS 1 1 VAL 1 INCREMENT VALUE STK VAL 2 SIACK POINTCR El VA1 3 =BASE+R =JlTABLEIRJ J K VAL 4 OUTER LOOP COUNTER L VAL 5 INNER LOOP COUNTER X VAL 6 OPERAND 1 Y VAL 7 OPERAND 2 * * * OFFSETS FROM STACK PCINTER * P VAL 1 ‘PASS NUMBER*; CONTAINS If2Jr FIRST INCR R VAL 2 PROCESSOR NllMBER, =OFFSET IN TABLE 2 VAL 3 SAVED ITEM FROM TABLFtRJ TN VAL 4 7lTABLElNII STKDEP VAL 5 DEPTH :lF STACK PER PROCESSOR l
* * * * FORWARD KEFERENCES * COUNT VAL OF SLV VAL c35 Ll VAL 037 L2 VAL 052 L3 VAL C84 St3 vAL OF0 51 VAL 03E 53 VAL 082 55 VAL cc4 57 VAL CEO
3c5. 306. 30). 3C8. 3c9. 310. 311. 312. 313. 314. 315. 316. 317. 318. 319. 320. 321. 322. 323. 324. 325. 326. 327. 328. 329. 330. 331. 332. 333. 334. 335. 336. 337. 338. 339. 340. 341. 342. 343. 344. 345. 346. 347. 348. 349. 350. 351. 352. 353. 354. 355. 356. 357. 358. 359. 360.
FIG. B-3 (continued).
-127-
EXAMPLE 2: SHELL SORT 112 APR 721 CISCD ASSEHBLER
364 3b5 Jtdo 307 368 369 J7U J71 372 373 OUdCl 374 OOUl 375 GlJO2 37b OGJ3 377 olJci4 37d 0005 379 OGC6 3Bti 0037 3dl Oi)Sd 3t.2 0009 38J UUOA 334 oous 385 ococ 3a6 0030 387 OOCE ;1bB OUOF 3e9 0013 390 0011 391 0212 392 0~113 393 CQ14 394 0015 395 OOlb 396 0017 397 3Y8 JcJlB 399 OGl9 400 OOlA 4Dl OGlt) 4iri ooic 4i)3 0010 4G4 OOli 405 OClF 4Lb oil20 4ti7 0021 4U8 0022 4c9 OG23
* * *
*MICROMEt =STORE VOOilUOODO VOOOOOOAO VOSOOO37A VO’)OOGC4A V00000000 v00000001 v000c0011 vo’lm-!ouoo vooooooco v30000019 vo3oooco1 v00000ir7c v00000390 vooooou9o VOOOOC07E V000~054E COUNT v03000c90 VOOOOCD7E v0300c010 VCQOr)OOZF voooo@c5l VOGOG03CG VO’)OI;OOAF vo’)oooJ5B
* VOJ0003Cl voooDDoa3 VD3OCO 579 vooocoocl VOOOOOO42 V3000000E v’)ooCo349 v3oooonoB vooooooc2 VOOOOOC42 VODOOCOOE VO J@OOflCY VOOOOOOOB VOOOOOO4A VO700003D v000000cl1 VOOOOO07D VOOOOOoOD v00000cl01 V0000007tl vooooooc4 VO3000042
4i3 0024 411 0025 412 0026 413 GO27 414 OcJZiJ 415 0029 416 OOZA 417 OOZD 418 oczc 419 03LLl
56 VAL CES 54 VAL Cl02 ENDTE ST VAL OllC CYCLE VAL 013A
PRUC HlCPCHEH tlEG STORE LA1 0200//S AI 0200 STR STK LK STK HtHF WIN INCA MEMF LI 0 STR I WIN STR K SHR LN,lll SHR LN.111 STR X LR X SHR LN. Ill STR X TEST SK4 EQ INCR I LA1 COUNT//S AI COUNT B
LI 1 I I=SHIFT COUNTI SHL LN.(Rl) SHlFr LEFT BY COUNT STH I I l2)=2**lLCG h)-2 Ll P AR STK YEMS LR I ofltbl LI R AU STK MEMS LP. 1 OWEM LR STK HEMF WIN STR L MEMF WIN STR 13 LI TN AR STK
INITIALIZE STACK POINTER CCHPUTE 1121
FETCH N INITIALIZE COUNT TO 0
SAVE N IN K N/2 N/4
HOLD IN X
SHIFT UNTIL = 0
CCUNT ONE WHEN -=O GO TO SHIFT LCOP
STORE I I21 AT P
STORE RIMAXI AT R
RfMAXI=IlZI CCHPUTE alTABLE(Nll
GET BASE
361. 362. 363. 364. 3.55. 366. 367. 368. 369. 370. 371. 372. 373. 374. 375. 316. 377. 378. 379. 3RC. 381. 302. 3e3. 384. 385. 386. 397. 3ea. 389. 390. 391. 392. 393. 394. 355. 356. 397. 358. 399. 4co. 401. 402. 403. 404. 4c5. 406. 4c7. 4ca. 409. 41c. 411. 412. 413. 414. 415. 416.
FIG. B-3 (continued).
-128-
EXAMPLE 2: SHELL SORT 112 APR 721 CISCO ASSEMBLER
423 421 422 423 424 425 426 421 428 429 430 431 432
OOZE OOZF 0030 033 1 0332 0033 0034
VOOOOC30E voooooo4c voooooo43 v00000008 v000000c1 voOooooB7 voooOoo58
l l
vo!looooo1 SLV VOOOOOOTA v000000c1 Ll V30000042 voooooooo vooooooo1 voooooo79 VOOOOOOc2 voooooo42 V00000000 v00000001
MFMS LR K AR 0 OMEH LAI L1//5 AI Ll R
411. 41e. 419.
STORE AT TN 420. GO TO NEY PASS 421.
422. 423. 424. 425. 426. 427.
SLAVE STARTS HERE GET NEW STACK POINTER FROM MASTER INITIALIZE I TO r(2) FOR NEW PROCESSOR 428.
429. 43c. 431. 432.
CCHPUTE ZIlTABLEiRll 433. 434. 435. 436. 437. 438. 439. 440. 441. 442. 443. 444. 445. 446. 447. 448. 449. 450. 451. 452. 453. 454. 455. 456. 457. 451. 459. 460.
E=BASE+R=a(TABLEIRI 1 PUT MARKER IN TAELEIRI
LARGEST NEC NUMBER PLACE IT AS A MARKER
TABLE(R) -> Y STORE AT 2 IN STACK
GET R
0035 3036 503 7 0038 0039 003A 0038
WIN STR STK LI P AR STK HEHF WtN STR I LI R AR STK MEMF WIN STR B LR STK flEHF WIN MEHF WIN AR B STR 8 HEMH
433 434 435 436 003c 437 0030 438 003E 439 003F 440 ODCS v0I)000070 441 ou41 VOOOOOO4A 442 0042 443 0043 444 0044 445 0945 446 Ji46 447 uo47 448 0048 449 0049 450 J04A 451 0048
voooooooo v00000001 VOOOOOOOD vocJooooo1 v00050043 voooooo75 VO9OOOOOF vcl30000c1 v33000098 vooouoooB VOrJOOOOC3 VO7000042 V0000000E VOJOGOOOl VOOOOOO7F voooooo’Jtr
* v000000c2 L2 VOOOC’OO42 v03000000 v33000031 v0lJ000012 V00000070 v00000010 VO!lOOOO26 vooooooc4 VOOOOOOA4 voooooo58
* vouooooc1 vooO@onB5 v000110004 V03000OlA VOOOO003E
R:l
R=lt NO MORF PROCESSORS REP’0
LI 1 SHR CN.11) OMEM
452 004c 453 OOCD
LI 2 AR STK MEMS WIN STR Y OMEM
454 455
OO4E Ob4F 0050 0051
OG52 0053 0054 G355 0056 0057 0056 0059 0051 005t) 005c
005D DOSE OOSF 0060 0061
455 457 45li 439 LI R
AR STK MEt4F
460 461 462 4b3
WIN DECA STR L TEST SK3 GT LA1 L3f 15 AI L3 8
LA1 SLV//S AI SLV FORK TSL SK3 hE
464 465 4bb 4bl 4b8 469 470
461. 462. 463. 464. 465. 466. 46-I.
471 4 72 473
R>l; TRY TO GET A SLAVE
GET ONE?
460. 469. 470. 471. 472.
,IC&lY
474 475
FIG. B-3 (continued).
;129-
EXAMPLE 2: SMELL SORT (12 APR 72) C ISCO ASSEMBLER
476 471 478 479 400 481 482 483 484 485 486 467 480 489 490 491 492 493 494 495 496 4Y7 490 499 500 501 502 503 504 505 5li6 501 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 52b 527 528 S29 530 531
0062 0063 0064
0065 0066 006 7 0068 0069 006A 0068 OObC OObD OObE 006F 0070 0071 0072 0073 0074 out5 0076 0077 0078 0019 007A 0078 007c 0070 007E 007F 0080 0081 0082 0083
OC84 OJ85 0086 0087 OOBE 0089 OOBA OOBB 008C ooao 008E 008F 0090
0091 0092 0093 0094 0095 0036
vooooooc4 VOOOOOOA4 voooooo58
l
VO000004A voooooooo vooooooc5 VOOOOO042 VOOOOOOOE v00000001 voooooooB v000000c1 VOJOOOO42 VOOOOOOOD VOOOOOOC6 VOOOOO042 VOOOOOOOE v00000001 v00000000 vooooooc7 VOOOOO042 VOOOOOOOE V0’)000040 voooooooB V000000c4 VOOOOO042 voooooooo V000000c9 VOOOOO042 VOOOOOOOE v00000001 v00000000 vooooooc5 VOOOOO042 VOOOOOOOA
l
v00000040 L3 voooooo41 voooooo7c v000000c1 VOOOOO042 V00000000 v00000001 VOOOOO061 v00000010 V0000003E vooooooc4 VOOOO&&E voooooo5.3
* Vo000004c s2 voooooooo- v00000001 V0000007E VOo0000cl voooooo98
LA1 L3//5 Al L3 B
NO; FORGET IT FOR NOW
LR STK YES; GENERATE PARAMETERS NEMF LI STKOEP AR STK iEMS WIN OMEf4 LI P AR STK HEMF Ll P+STKDEP AR STK HEMS WIN OMEM LX R+STKOEP
EMS STK
LR L OMEM LI TN AR STK ME MF LI STKOEP+Th AR STK HEMS WIN OMEM LI STKOEP AR STK OSL
LR 0 AR I STR i Ll P AR STK HEHF WIN SR I TEST SK3 NE LA1 51//s Al Sl 0
LR K HEHF WIN STR X LI 1 SHR CN.11)
SLAVE’S STACK POINTER I=a(O)l
COPY 0
473. 474. 475. 416. 477. 470. 479. 480. 481. 402. 403.
cam P
484. 485. 486. 407. 488. 489. 490. 491.
STORE ‘R-l’ AT SLAVE’S R 492. 493.
(R-1, FROM ABOVE) 494. 495. 496.
COPY TN FOR SLAVE
SEND NEY STACK POINTER TO SLAVE
INITIALIZE K=il~TABLElR+I~J GET I(21
497. 498. 499. 500. 501. 502. 503. 504. 505. 506. 507. 508. 509. 510. 511. 512. 513. 514. 515.
1121:1 516. 517. 518.
t=IlZIi NO DATA DEPENDENCY. GO SORT 519. 520. 521. 522.
CHECK IF SORTIP-l.R*I) IS DONE 523. LOOK AT TABLE(R+Il 524.
525. 526.
GENERATE MARKER 521. 52B.
FIG. B-3 (continued).
-130-
EXAMPLE 2: SHELL SORT (12 APR 721 C ISCO ASSEMBLER
532 533 554 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558
0097 0098 0099 OiJ9A 009a 009c
0090 OOYE OOYF OOAO OOAl
voooooo66 v00000010 VOD00003E VOOOOOoc4 VOOOOOORl voooooo58
* VOOOOOO3F voooooo4c 51 voooooooo v00000001 VOOOOOOlE
*
SR X TEST SK3 NE LA1 52//S Al s2 e
SK4 NE LR K MEHF WIN STR X
2 IS TEMPORARILY STORED LR X SR Y TEST SK3 LT LA1 53//s Al 53 B
COMPARE TABLE(R+ll YITH MARKER
LOOP UNTIL MARKER IS GONE
525. 530. 531. 532. 533. 534. 535. 536. 537. 538. 539. 540. 541. 542. 543. 544. 545. 546. 547. 540. 549. 550. 551. 552. 553. 554. 555. 556. 557. 558.
(ALWAYS SKIPS) GET TABLE(R+I)
IN Y AS WELL AS IN STACK
X:2
GO TO OUTER LOOP
OOA2 V0000004E 0013 VOOOOO067 OOA4 OOAS
voooooo 10 VOOOOOOZA voooooocs voooooo82 voooooo58
* vooooooc3 VOOOOO042 VOOOOOOOE V0000004E VOOOOOOOtl voooooo4c VOOOOOOOE VOOOOOO4F VOOOOOOOR voooooo49 53 vrJooooo44 V0300007c
OOA6 OOA7 OOAM
OOAY OGAA COAL3 OOAC OOAO OOAE OOAF OOBU
LI 2 AR STK MEMS LR X OHEM LR K
INTERCHANGE x AND 2 STORE X AT 2 IN STACK
2(TABLEIR*II)
STORE 2 AT TABLEtR+Il
K<-K* I
559 560
MEMS LR Y
b61 OOBL 5~2 OGB2
OHEH LR I 559.
560. 563 564 565 566 567 560 5b9 570 571 572 513 514 515 576 577 578 519 580 501 582 %33 584 565 586 507
0083 ooil4
AR K STR Ll fN AR STK MEMF WIN SR K TEST SK3 GE LA1 54//s AI 54 B
LR K MEMF STR L WIN STR X LR L SR I STR L SR B
561. 562. 563. 5t4. 565. 566. 561. 560. 569. 570. 571. 572. 513. 574. 575. 516. 577. 578. 579. 580. 581. 582. 583. ia4. 1lCWll
OOd5 vooooooc4 0086 v00000047 0087 0008
voooooooD v00000001
0089 VOOOOO064 OOaA v~30000010 ii(TABLEtN)):K
‘OONC WITH INSERTION SORl
lTASLEtR+V+Il I
OOdB OOBC OOBO OOBE
OOBF ooco ooc 1 OOC2 ooc3 ooc4 ooc5 OOC6 OOC? ooce OOCY OOCA
VfJ000003A voooooocB VOOOOOOAZ voooooo58
* voooooo4c voooooooo v00000070 vooooooo 1 V0000007E V00000040 55 VOOOOO061 V00000070 VOOOOO063 VOOOOO061 v00000010 V0000003A
SR 1 TEST SK3 GE
L:R+I
FIG. B-3 (continued).
7131,
588 OOCB 589 uocc 590 OOCD 591 592 OOCE 593 OOCF 594 0000 595 DOD1 596 0002 597 0003 59tl 0004 599 0005 600 0006 601 0007 602 OGJ8 603 0009 604 OCOA MS 0006 606 oooc 607 0000 608 O(iOE 609 OOSF 610 611 ODE0 612 ODE1 613 OOE2 614 00E3 615 OOEI 616 617 616 ODE5 619 OOEb 620 OOE? 621 GOEB 622 OOEY 623 OOEA b24 OOEB 625 OOEC 626 OOEO 621 OOEE 628 03EF 629 OOFO 630 OOFl 631 OOF2 632 b33 OOF3 b34 OOF4 635 OOFS 636 OOFb 631 COF? 638 OOFLI 639 OOFP 640 OGFA 641 OOFB 642 OOFC 643
EXAMPLE 2: SHELL SORT I12 APR 721 CISCO ASSEMBLER
vooooooc7 VOOOOOOAS voooooo58
V00000040 V00000000 v00000001 VOOOOOO7F voooooo66 v00000010 V00000G40 voooooo41 VOOOOOOOE VOODOO026 vooooooc7 VOOOOOOAO voooooo58 VOOOOOO4F voooooooB VOOOOOOCb VOOOOOOA4 voooooo58
VO000004E VOOOOOOOB vooooooc5 vooooooB2 VOGOOO058
v’ooooooc3
v00000010
VOOOOO042 V00000000
voooooo48
v00000001 VOOOOOO7F voooooo66
v00000041 VOOOOOOOE VOOOOO026 v000000c1 vooooooao voooooo58
VOOOOOO4F voooooooB V000000c3 VOOO@OO42 VOOOGOOOE V3000004E VOOOOOOOB vooooorJc5 voooooOB2 voooooo58
*
* 57
* * 56
*
*
LA1 S6//5 AI 56 8
LR L ME MF WIN STR Y
::ST X
LR L AR 1 UEMS SK3 GT LA1 57//s Al 57 B LR Y OMEH LA1 55//s Al 55 El
LR X OHEM LA I 53//s AI 53 B
LI
TEST
2 AR
LR
STK MEMF
B
WIN STR Y SR X
AR I HEMS SK3 GT LA1 58//S Ai SE I3
LR Y OMEM Ll 2 AR STK MEMS LR X OMEM LA1 53//s Al 53 B
L<R+I: GO 00 LAST ONE
L=R
L>=R+I: 00 NEXT
y:x
X>=Y; INSERT HERE
X<Y; RIPPLE Y
GO TO INNER LOOP
INSERT X AT TABLElL*I I
GO TO OUTER LOOP
LAST COMPARISON: USE 2
2:x
STORE 2 AT TABLEIR+Il
SlURE X AT 2
GO TO OUTER LOOP
585. 586. 58f. 588. 589. 590. 591. 592. 593. 594. 595. 596. 597. 598. 599. 600. 601. 602. 603. 604. 605. 606. 601. 6C8. 609. 610. 611. 612. 613. 614. 615. 616. 617. 618. 619. 620. 621. 622. 623. 624. 625. 626. 627. 620. 629. 630. 631. 632. 633. 634. 635. 636. 637. 638. 639. 640.
FIG. B-3 (continued).
-132-
EXAMPLE 2: SHELL SORT (12 APR 721 CISCO ASSEHRLER
b44 * b45 OOFD V030C1004E SB 646 OljF t vc9ocoGoB b47 CCFF v5oor)noc5 648 OlCC vOOo~oo~2 649 Cl131 v300~0058 65J *
LR X OMCH LA1 s//5 AI 53 B
INSERT X AT TABLE(R+Il GO TO OUTER LOOP
651 1 b52 OlJ2 v00000047 54 b53 0123 v00010c90 054 OlU4 voJo~m7Y
LR I WR LN.11) [<-I/Z STR I
655 OIG’, v000000c2 LI R 656 31Ub VOOO60342 AR STK 653. 657 Cl07 VODO3Oil@D MEMF 654. 658 oiue vooosco~i WIN 653 0129 V’)JODGO61 SR I R:I 6f.6 OlUA VO~JOUOO13 TEST bhl OlOtl VOOOOO~36 SK3 LE 6b2 663 064 bt.5 665 6t.7 6bd 6b9 b7C 671 372 673 674 675 076 077 678 679 080 b&l1 682 iJll3 684 bs5 686 b87 bdB 689 693 691 b?.? b93 694 b95 I96 691 698 099
CLDC VcOOGoCca 0130 vo~ococIBc 310E VC00C005A
013F v030000c3 0113 VO3OCOO42 0111 V9000000D 0112 vo!I33coc’3l 0113 VOOOOOO7F 0114 VOCOGO31A 0115 V0000032E 0110 v9300c0c4 0117 v0000031\4 OLlcI voooooo5a
0119 voooPoDc2 OllA v00000232 JllL-5 v01)000ir58
SllC vooooooc3 OllD VO5000042 DllE vcooooooD OllF VO~OOO04B 3120 V0300030E 0121 v0c000001 0122 v03uiJ0000 5123 VOOOoOOc2 0124 VOOOOO042 0125 VOOOOOOOD 0126 voooooool 012 7 v00000012 OlZtl VDOOD007D 0129 vo3ocoo1o DlLA V000D003E 0126 V000000c9 012c VOOOOOOBA 012D vOooooo5B
LA1 Al B
* LI AU MEHF WIN STR TSL SK3 LA1 AI B
l
LA1 AI B
* * ENOTEST LX
AR HEMF LR HEMS WIN OHEM LI AR HEMF WIN DECA STR TEST SK3 LA1 AI B
ENDTEST// KC MORE PASSES FOR THIS R ENOTEST
2 STK
RElNlTIALIZC Y=Z FOR NEXT PASS
Y
EQ L3//5 L3
(MORE PASSES TO GO) SLAVE ATTPCHEO? YES; GO TC NEXT PASS
L2//5 NO; TRY AGAIN TO GET ONE L2
2 STK
6
STORE 2 AT TABLE(R)
R NC MORE PASSES FOR THIS R STK
L HOLD R-l R:l
NE CYCLE//S A=1 ; THAT’S ALL CYCLE
641. 642. 643. 644. 645. 646. 641. 640. 649. 650. 651. 652.
655. 656. 657. 658. 659. 660. 661. 662. 6t3. 664. 665. ht6. 667. 6t0. 669. 670. 611. 672. 673. 674. 675. 676. 617. 618. 679. 660. 681. 612. 683. 684. 685. 6Bb. 607. 680. 689. 690. 691. 692. 693. 694. 695. 656.
FIG. B-3 (continued).
-i33-
70u 7Cl 7”Z lu3 7L4 7d5 7Ub lL7 728 709 71L 711 712 713 714 715 716 717 710 719 I 2 0 721 722
012E 012F J13U 0131
Cl32 11133 ii134 tJ135 0136 UIJ7 013b 0139
013A
EXAHPLE 2: SHELL SORT II2 APR 721 CISCI-I ASSEFlRLEP
* VODO@OOlA v33000020 V03000006 v39000000
* VOOOOO:)c2 VO’JOOOO42 Vo!lOo0DnE V’l’IOC004D v3onoGou0 VO~JOC@L)CI v 3ooooc37 VOCCOGOSA
* *
VOOOOOO30 CYCLE t *
%RUN %SNAP ZSTOP
TSL SK2 EQ DET STOP
Ll R AR STK HEMS LR L OMEM LA1 I.1 //5 Al Ll B
STOP RETURN TO MAIN INTERPRETER
TEXT fRUN TEXT ZSNAP TEXT XSTOP FND
THIS R IS OWE. NEXT BEGUN?
YES; I MAY GO AWAY
NC: THIS PROCESSOR BEGINS AS R-l
(CONTAINS R-l FRCH ABOVE1 STORE R-l AT RI CC BECOME PPCCESSCR R-l
657. 6SO. 699. 700. 7c1. 702. 7c3. 104. 705. 7C6. 707. 7cLI. 709. 71C. 711. 712. 113. 714. 715. 716. 711. 710. 719.
FIG. B-3 (continued).
-1349
tXAMPLE 3: ~AWIX HULTIPLY ill CAY 72)
1% * 1Yl 4 1% oi 199 * IhITIAL~ZAllCh OF MAIN MEMORY
0 200 201 202 203 204 205 2Ob 207 208 209 210 211 212 213 2 14 215 2 16 217 2 18 2 19 220 221 222 223 224 225
*MEMO = MtMORY =MFNORY 5 192 VFFI-Fkk50 VFkFFl k57 VFFkFFFSt
PROC KtG ORG
P’FMC rEMCaY C300//2
CISCO ASSEHRLER
1Vb. 197. 198. 199. 200. 201. 202. 203.
-176 204. -169 205. -lb2 206. C 207. -147 20.5. -140 209. -133 210. 37 211. -1 212. C4OOfI2 213.
-126 -1ld -111 -104 0
-90 -82
112 101
oucc OLlCl ouc.2 oocj OOC4 ilOC5 OOCb 03c7 ooca
All1111 DC A 151521 DC AL32431 DL
vlJoccoooo A313341 VFFFFl FbD A3542
6‘1351 A5155 A55 Ab3
VFFFFTF14 VFFFFFF7H V 00000025 VFFFFFFFF =MLMflRY 5 25b VFFFFFFBZ VFFFFFFLIA VFFFFFF91\ VFFFFFF9U voooooouo VFFFFFFA6 VFFFFFFAE vcooaoi)7o VOOOOOOb5
DC DC UC DC DC DC OR’S
DC DC DC DC DC DC DC
214. 215.
0100 0101 0 102 0103 0104 0105 0106 0107 0 108
lsllll n1521 uz431 216.
217. 218. 219.
03341 642 MS1 845 220.
221. 222.
DC DC Bb3
* 226 * MATHIX C BEGINS AT 0500
PHOC WZHl HtC CEHCHY ORC 0300//Z
223. 224. 225. 226. 227. 22A.
DC 0 229. DC -68 230. DC -61 231. UC -54 232. DC -46 233. UC -39 234. DC 1 ‘I 235. DC -4 236. DC 129 237. URG 0400//Z 230.
DC -32 239.
DC -25 240. DC 0 241. DC -10 242. DC -3 243. DC 3 244.
DC 68 245.
DC -91 246. DC 157 247.
227 228 229 230 231 232 233 234 235 236 237 236 239 240 241 242 243 244 245 246 247 248 249 250 251
*MEHl =MEHOHY =MEMOPY 5 192 vocJooooLlo VFFFFFFBC VFFFFFFC3 VFFFFFFCA VFktiFFkD2 VFFFFFFD’) vooJoooll VFFFFFFFC voooooo81 =HEMUPY 5 256 VFFFFFFEO VFFFFFFE7 v00000000 VFFFFFFF6 VFFFFFFFD vooooooo3 voooono44 VFFFFFFAS V 0000009D
*
oocc OOCl uocz OJC3 ODC4 OOCS OUC6 ooc7 ooctl
A1212 ALL:2 A2532 A.3442 A43 A52 A521 A56 Ati
0100 0101 0102 0103 0104 0105 0 106 u107 0108
Bl212 b2122 82532 83442 ti43 852
FIG. B-4. Example 3 - assembler listing.
-135-
252 253 254 255 256 251 258 259 260 261 262 263 264 265 266 26-l 268 26Y 270 271 272 273 274 275 276 277 278 279 280 281 282 2H3 284 285 286 287 2Un 28Y 290 291 292 293 294 295 296 2Y7 2Y8 299 300 301 302 303 304 305 306 307
EXAMPLE 3: MATRIX HULTIPLY (11 MAY 72) CISCO ASSEMBLER
*HEM 2 PROC MM2 248. =HEMORY REG MEMORY 24% =MEMORY ORG C300//2 250. S 192
OOCO VOOOOOOOA ooc 1 v00000012 oocz vouooouoo ooc3 v00000020 OOC4 V 00000027 OOCS VOOOOUOZE OOC6 VFFFFFF55 OOC 1 voooooo8a uoca voooooooo
=MEMOKY S 256
0 LOO VOOOOO036 0101 v00000030 0 LO2 voooooo44 0103 voooooo4B 0104 voooooooo 0 105 VOOOOOOSA 0106 VFfFFFFA4 0107 v00000012 3108 v00000041
*MtH3 =MfHORY =HLMOHY S 192
OOCC VOOOOOO61 UUC 1 VOOOOOO68 OK2 voooooo6F ox 3 v0000007.5 ON4 v0uu00000 uucs voo~~oOo85 OX6 VfFFFFFlA oUC7 VOOOOOOA2 OOCR VFFFFFFFB
=MEMOPY S 256
OlUC vcoooUo8c U101 voooooo93 0102 VUOOOOO9A 0 103 VOOOOOOA2 0104 VOOOOOOA9 0105 vooooooBo 0106 VFFFFFFFl 0107 vcooooooY 0108 voooooo93
A1313 A2223 A3133 A3543 A44 A53
01313 M2223 83133 63543 I344 853
A1414 A2324 A3234 A4144 A45 A54
01414 82324 ti3234 H4144 845 b54
* * * + l *
STK CADK
DC 10 251. DC 18 252. DC 0 253. DC 32 254. DC 39 255. DC 46 256. LX -171 257. DC 136 258. DC 0 259. ORG 0400//Z 260..
DC 54 261. DC 61 262. UC 60 263. DC 75 264. DC 0 265. DC 90 266. DC -92 267. DC ia 268. DC 65 269. PROC HEM3 270. HEG MEHCRY 271. OHG 0300//Z 272.
DC 97 273. DC 104 274. DC 111 275. DC 118 276. DC 0 277. DC 133 278. DC -134 279. DC 162 280. DC -5 281. OHG 04UO//2 282.
IJC 140 DC 147 UC 154 VC 162 DC 169 IX 176 DC -15 DC 9 DC 147
283. 284. 285. 286. 287. 288. 289. 290. 291. 292. 293. 294. 295. 296. 297. 298. 299.
KEG1 SlEH ASSIGhHENTS
VAL 1 STACK PC1 NTER VAL 2 DESTINATION AUDRESS
FIG. B-4 (continued).
-136-
EXAMPLE 3: MATRIX MULTlPLr (11 MAY 721 CISC[I ASSEt4HLf.K
308 3OY 310 311 312 313 314 3 15 316 317 318 3 19 320 321 Jr?2 323 324 325 326 327 328 32Y 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 35Y 360 361 362 363
LtN INOt X
Loo J A AAUR b BADP * STKBA SE PUSH * * l *
IbASE
VAL VAL VAL VAL VAL VAL VAL VAL VAL
LENGTH ANC WIIlTtl I-IF MATWICkS CDUNTEW FCR INNLH PHOltllCl LULJP MCW t#.JMRER, DUHING STARTUP PRIJWICT, CURING INNLK Pk’fllJUCT COLUMN MJtWFk, UUKlNG STAHlUP LIPEHANO 1 I LUnlNG tWLT1Pl.Y AUDKESS UF I>PEKAYU 1 UPERANU 2, OUKING YULTIPLY AUDRESS 01: UPEHANU 2
VAL 2 OFFSET OF STACK BASF POINTLH, IN STACK VAL 3 LENGTH II STACK PUSH, t”ER PHUCtSSL!f’
OFFSElS FRCM STKBASE
VAL 1 I INDEX VALUE, FOR STARTUP Jt)ASk VAL 2 J INLltX ABA SE VAL 3 A VECTOR AUDHESS BbASL VAL 4 BASELEN VAL 5 t * * * * L LAYCILI OF STACK
0 VECTDI( AUDKESS LENGTH OF STACKBASE
* + 0 * * * * * t t * * * * * * * * * * *
*Pl *II =R s 1
0001 VOOOOObOO
* l INI TIAI. STK-------->FLAG <-e-e * I UASE I
AREA
JRASE I ABASE BBASE
FIRST STK-------->AAUR(l) RADRil I I
STKRASE---‘l SECOhD STK-------ZAADRlZJ BADRtZl I STKBASE--->I
INITIALLY, AADR=ARASE=AUDHlA~l,I)) 8ADR=UaASE=ADORf8Il, l~~ CAOR=Ct)ASE=ADDR~CfIrlI l LEN=LENGT H STK= INITIAL POSITION. AS ABOVE
INITIPLIZATlCh OF REGISTERS
PHOC Pl REG R ORG SrK
DC 06OC INITIAL STACK POINTER
300. 301. 302. 303. 304. 305. 306. 307. 308. 309. 310. 311. 312. 313. 314. 315. 316. 317. 318. 319. 320. 321. 322. 323. 324. 325. 326. 327. 32A. 329. 330. 331. 332. 333. 334. 335. 336. 337. 338. 339. 340. 341. 342. 343. 344. 345. 346. 347. 348. 349. 350. 351. 352. 353.
354.
2101w
FIG. B-4 (continued).
-137-
364 365 366 367 368 369 370 311 372 373 314 375 316 377 310 3 19 3RO 381 382 383 384 385 3d6 381 388 389 390 391
EXAMPLE 3: MATRIX MULTIPL\r (11 HAY 72) C I SC0 ASSEMBLER
=H s 2
0002 vooooo5uo =R s 3
0003 VOOOOOOOb =ft S 6
0006 v00000300 =R S 7
000 7 v 00000400 * * * * * * * f * * HULD INNER LOAD MATRIX MtND MI.OUP MUKE MUL NEXT ROW SET SLAVt ZEMO CYCLE * * *
392 393 394 3Y5 3Y6 397 398 395) 400 401 402 403 *MICROHEM 404 = STURE 405 0000 v000000c1 406 0001 V 0000007D 407 0002 v0000001c 408 ouo3 v000000c1 409 0004 vooUoou41 410 0~05 vOOOOOOOE 411 0 006 V 00000049 412 0007 VOOOOOO08 413 oooa vouooooc5 414 0005 v00000041 415 UUOA v00000019 416 411 OOOB vU000004D 418 OOUC V OOOOUd63 419 OOOD VOOOOOOlO
* MATRIX
ORG
DC OHG
CADU
cscc LEN
ADURESS OF C(l.1)
DC 6 OWG AAOH
LENGTHGWI OTH
DC 0300 ORG BADR
ADDRESS OF A(1.11
DC 0400 ADDRESS OF Bllvll
FOKkARD REFERENCES
VAL 077 VAL G7E VAL 013c VAL OR VII. OCD VAL OAE VAL 054, VAL C’i5 VAL OllA VAL 033 VAL 060 VAL 0120 VAL CDE VAL C3FF
PHOC MICRCb’EM REG STORE LI 1 IYITIALILE I,J To 1 STH J STR I LI BASELEbSTKt3AS.E AR STK PUT STKBASE IN FLRST BLOCK iEMS L?. STK OMEt’ LI BASELtN MOVE STACK POtNTER UP TO FIRST BLOCK AR STK STR STK
LR J SK LEN TEST
J:LEN
355.
356. 357.
358. 359.
360. 361.
362. 363. 364. 365. 366. 367. 368. 369. 310. 371. 372. 313. 374. 315. 376. 377. 370. 37Y. 380. 381. 382. 383. 384. 385. 386. 387. 388. 389. 390. 391. 392. 393. 394. 395. 396. 3Y7. 398. 399. 400. 401. 402. 403. 404. 405. 406.
FIG, B-4 (continued).
-138-
EXAMPLE 3: MATRIX HULTIPLt 111 MAY 721 CISCO ASSEMBLER
420 421 422 423 424 425
OOOE VOOOOOOZA SK3 LT OOOF v oooouoc 1 LA I ROk//S
407. 408. 409. 410. 411. 412. 413. 414. 415. 416. 417. 418. 419. 420. 421. 422. 423. 424. 425. 426. 427. 42.8. 429. 430. 431. 432. 433. 434. 435. 436. 437. 430.
0010 OUll
0012 0013 0014 0015 0016 OU17 OS18 a019
vooooooti3 v00000058
t v ooooooc.2 voooooo41 vooooooou v c000u001 v00000011 VOUOOOOOE voououo5c v 0000000M v uooooou V0000U012 VOOOOOOOE vooooooco v 00300008 v 00000054 voooooo4c VOOOOOOOF v 00000040 voo~ooo11 v 0000000B voooooo54 v 0000004c VOOOOOOOE v 0000004E v oooooooe voooooo54 vaooooou V UUdOOOOE vaooooo4F v ooocao~l v oooaooo8 v000000c2 v0000008A v 00000058
* * 1
k aoaaoo4c KO w v 00000063 voaooooio V CCOCOOZA v 000000c3 v 00000087 v 00000058
* *
AI ROk 8
LI STKBASE AU STK HE MF WIN INCA HEMS EXK I LIWEP LR I UECA l4EHS LI 0 LIME M INCR I LR I lllzrs LR J INCA OMEN INCR I LH I HEMS LR AAOR OMEH INCR I LR I HEMS
42a 427 428 429 430 431 432 STORE I AT IBASE
STORE 0 IN FLAG
433 OUlA 434 0010 435 436 437
ooic 0010 OUlE OOlF 0320 0021 0022 0023 0024 0025 0026 0027 0028 0029 OOZA ooze ao2c 0020 OOLE OUZF 0030 0031 UU32
438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457
STOKE J+l AT JBASE
STORE AADR AT ABASE
LH RADH 439. INCA 440. Of4EH LA1 MORE 1/S AI MORE 8
STORE BADR AT WASE
GO GET A SLAVE
J=LEN
I :LEN
I=LEN: GO 00 LAST ONE
441. 442. 443. 444.
458 459
445. 446. 447. 440. 449. 450. 451. 452. 453. 454. 455. 456. 457. 450. 459. 460. 461. 4b2.
460 461 462
0033 0034
LR I SK LEN TEST SK3 LT LA1 HOLD//5 AI HOLD B
4b3 464
0035 0036
465 466 467 468 469
0037 0038 0039
ii; STKBASE STK
I <LEN 470 471
003A 0038
voooaooc2 v 00000041
472 003c v00000000 ME MF 473 0030 v ooaaoooi HIN 474 OU3E voooaooll INCA 475 003F V0000000E HEMS STORE I+1 AT IBASE
FIG. B-4 (continued).
-13%
EXAf4PLE 3: MATKIX HULTIPLY (11 f’AY 721 LISCfl AStEMRLEP
476 477
0040 0041 0042 0043 0044 UO45 0040 0047 0048 0049 004A 0048 004c uu40 OOCE ao4F OU50 0051 U052 0053 0054 0055 0056 0057 0058 OUSS OUSA au58 005c 0050 OOSE 005f
v 0000005c v 0000001 i v aoUoooo8 v 0000004c voooooa12 vaoooaoaE vooooooco voooooao0 voooooo54 voouaoo4c voaoooooE v000000c1 vaoaoooo~ voooaoo54 voooooo4c vooaaoaot v OOUU004E v00000043 VOOOOOdOR voooooo54 voo~oaa4c
EXK I INCA LIME n LR I
463. 464. 465. 4.56. 46 1. 468. 469. 470. 471. 472. 473. 474. 475. 476. 477. 470. 479. 4H0. 481. 402. 403. 484. 485. 486. 487. 460. 489. 490. 491. 492. 493. 494. 495. 496. 497. 490. 499. 500. 501. 502. 503. 504. 505. 506. 507. 508. 509. 510. 511. 512. 513. 514. 515. 516. 517. 518. 1IWA4b
478 479 480 c&i1
OECA HEMS
482 483 484
LI 0 OHEM INCR 1 LU I MEMS LI 1 OHE El INCR I LR I MEHS LR AAOR AR LEN UHEH INCH I LR I MFMS LR DAOR SR LFN tNCA OHEH LA1 SLAVE//S AI SLAVE FORK TSL
STORE 0 AT FLAG
485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 5Jl 502 503 504
STORE I AT JBASE
STORE AAoR+LEN AT ABASE
V0000000E VOUOOOO4F voooaoob3 vaouoooi1 v aocoUoof3 VOOUOOOC9 MUKE v 0UU000A0 vooooooo4 voaooaald
STORE BAOR-LEN+1 AT RBASE
TRY TO GET A SLAVE
NONE, GO SET FLAG
GOT ONE, SEND STARTUP
STACK PO1 NTER
DESTINATICN ADDRESS
LENGTH
GO 00 INNER PRODUCT
SET FLAG TO 1: MORE PROCESSORS NEEDED
505 506 507 508
V0000003E voaooooc3
SK3 NE LA1 SET//5 AI SET 0060
OObl
0062 0063 0064
'0065 0066 0067 OJf,R 0069 006A OObd 006C
vaoooao~o vaooaoo58
* v ooooooc3 vc0000041 V00000U0A v 0000004* v00000011 V OOOOOUOA voooooa4R vaaooooo~ voo0oooc3 V OOUOOOBE v coUooa50
* *
vaaoaooc2 SET vcooaoa4i
509 510 511 512 513
B
LI PUSH AR STK OSL LR CAOH INCA OSL LR LEN
514 515 516 517 518 519 520 521 522 523 524
OSL LA1 INNER115 AI INNEP B
0060 OObE 006F 007c
LI STKBASE AR STK HEMF
‘WIN HEMS LI 1 OMEN LA I INNER//5
525 526 527 528 0071 v 00uOoi)Of
0072 v000000c1 0073 vooaaaao~ 0074 vaoooooc3
529 530 531
FIG. B-4 (contiliued).
-140-
T F P
0
EXAMPLE 3: HATPIX MULTIPLY (11 CAY 721 C I SC0 ASSEI4BLEP
588 589 590 591 592 593 594 595 596 597 598 599 6OU 601 602 bO3 604 605 606 607 bU8 609 6 10 011 612 tJ 13 614 615 bib 617 618 619 620 621 622 623 b24 625 626 627 628 629 630 631 632 633 634 b35 636 637 638 639 b40 641 642 643
OOAB OOAC OOAO OOAE OOAF 0080 0001 out32 0083 0084 oot)5 OOHb 0087 iJOB8 oad9 OUBA 0088 OOBC OOBD OORE IJOBF ooca ooc I oocz oac3 OUC4 ooc5 OOCb ouc7 0acB ooc9 OUCA OOCB oucc
OJCO OOCE OOCF 0000 OOUl aou2 0003
0004 OUIJS 0006 OUD7 0008 0009 OOUA 8000 ouoc 0000 OUOE OOOk
v cuoaou58 * *
vOOOOOU9C HEND voooooola vcUoooo3n v 0000004u voooaool3 vaooaooll vaooaoo7u
* v 0000004A v 0ooaoaoF voooooaol v 00000045 voooouoaB v 00000033 voooooaFD VOOUOOOOE V000000FF voaoaauo~ voaaooo54 2tRO v oooooou
vcoUooa13 vaaoaaoll V 0000007F v aooaa04E MLOOP vooaaao98 vaooaoolo VOOOOOO3A vaoa00040 voooooo47 v ooilooo7o vaUooUo4E vaoocoo90 V 0000007E v00000010 vOOOOOO3F vaooaoocb V000000AD v 00000058 voouauo4F v00000080 v 0000007F v aooaaaio voooooo28 v coo00022 voooJooc5 VUOUOOOAE v 00000058 V OOUOOOFE vaouoooo~ V COOOOOFF voUoaooo0 VOUOOOaCb v 000000BE
-A INCA SIR R LH A SHR CN,ll) TESI SK3 GE LR PROD AR d STR PRO0 LR A SHR LN.11) STR A TEST SK3 NE LA1 MENU//S AI PEtill B LH 0 SML LN,ll) STH B TEST SK4 LT SK3 OV LA1 CLOCF//S AI PLCOF B LI -2 MEMS LI -1
IF LINK IS NOW 1, RESULT IS TU dE NEGATIVE
PUT LOW-OAUER BIT OF HULPLR INTO SIGN BIT
IF HEHOVED BIT=1 (TEST NEGATIVE) ADi) THE tWLTIPLICAND
SHIFT AYI) TEST THE MULTIPLIER
StE t)’ A WAS ZEHO AFTEK RIGHT SHIFT WNE WITH PULT IPLY
SHIFT IHE MULTIPLICAND
MULTIPLICAND OVERFLOW? ISEE If &IT SHIFTt!J INTO SIGN) CHECK FOR WLRFLUW ON AUD NU UVERFLOk OF EIT!tER KIlUOt GO ON
UHEM PRODlJCT JVERFLOW: SET GLOBAL FLAG LA1 ZLRO//S SKIP THE SUM ON OVtRFLOW AI LERC B
SHR CL,(l) TEST SK4 GE LR PRO0 -A INCA STR PKOO
LR CAOR HEHH WIN AR PROIJ OHEY SK4 NO LI -3 t4EMS LI -1 OMEH INCH INDEX LR INDEX
FIX UP SIGN--PUT LINK INTO SIGN BIT
RESULT SHOULD BE NEGATIVE
A00 PRODUCT TO SUM
SUM OVEHFLOW? YES, SET FLAG
INCREMENT THE COUNTER 1NDEX:LEN
575. 576. 577. 570. 579. 580. 581. 582. 583. 584. 585. 5R6. 587. 50R. SBY. 590. 591. 592. 593. 594. 595. 596. 597. 598. 599. 600. 601. 602. 603. 604. 605. 606. 607. 608. 609. 610. 611. bl2. 613. 614. 615. 616. 617. 618. 619. b20. 621. 622. b23. 62% 625. 626. 627. 620. 629. 630.
FIG. B-4 (continued).
-142-
EXAMPLL 3: MATRIX MULTIPLY 111 MAY 72) Cl FCU ASSEHtlLEH
631. 632. 633. 634. b35. b3b. 637. 638. 639. 640. 641. 642. 643. 644. 645. 646. 647. 648. 649. b50. 651. 652. 653. 654. b55. 65b. 657. b58. 659. 660. 661. 662. 663. 664. 665. 666. 667. 668. 669. 670. 611. 672. 673. 674. 675. 676. 677. 670. b79. 680. 681. b82. 683. 684. 605. 606.
644 OOEO b45 DOE1 646 OOE2 647 OOE3 648 OOE4 649 OOE5
OOE6 OOE7 OOEB OOE9 OOEA OOER OOEC OOEO OOEE OUEF OOFO OOF 1 OOF2 Oi)F3 OOF4 iJOF5 OOFb OOFl OOF8 OOF9 OOFA OOFB
VOOOOO063 v00000010 V OOUOOO36 voooooocn V OOOOOOHA v 00000058
* v 00000049 V OOOOOOOF v00000001 v00000011
L LEN
SK3 LE LA1 hEXT//5 Al NEXT
THAT’S ALL
AADR<-AAUR+L
a 650 651 b52 b53 b54 655 656 657 b5tJ 659 660 661 662 b63 664 bb5 666 667 b60 669 670 671 672 673 (174 675 676 677 b78 67Y bkl0 681 602 683 684 685 b86 6it-i 688 609 690 691 (192 693 694 695 696 097 b-J8 699
LR STK HEMP WIN INCA OHEf’ MEMF YIN STR A LI 1 AR SlK MEMM UIN AR LEN UUEM ME HF HIN STR B TSL SK3 EP LA1 PUL//S Al PUL B
v OOOOOOOR voooooooLl v 00000001 v 0000007t voooooocl v 00000041 VOOOOOOOF v c0000001 v 00000043 voooooooB V OOOOOOOD voooooool VOOOOOO7F VOOOOOO1A V 0000002E vooooooc4 v 000000B5 v 00000058
* *
v000000c2 v cooooo41 VOOOOOOOD v 00000001 v 00000000 voooooool v00000010 V0000003E voooooOc4 voooooot35 v 00000058
* vooooooc9 VOOOOOOAU voouoooo4 VOOOOOOlA VC000003E v000000c4 vooooooB5 voooooo5~
t vooooooc3 v 00000041 VOOOOOOOA V 0000004A
FETCH NEXT FROM AAOR
8ADR<-dADR+LEN
FETCH NEXT FROM BAi)R
SLAVE tXISTS: GO ON
NO SLAVE: LI STKBASE AR STK
OOFC OOFO OOFE OOFF Oldil 0101 0102 0103
MEMF WIN Ht HF WIN TEST
CHECK FLAG
SK3 NE LA1 F(UL//S AI MUL
0104 0105 0106 n
LA1 SLPVE//5 Al SLAVE FORK TSL SK3 NE LA1 CUL//5 AI PUL 8
F LAG=0 i NC ACTION REWIRED
FLAG-.=O: TRY AGAlN TO GET A SLAVE
GET ONE?
0107 0108 0109 OlOA OlOB 0 1uc. OlOC OLOE
010F 0110 OLLl 0112
LI PUSH AR STK
NO: GO ON
GOT ONE, SEND STARTUP
STACK PO1 NTER OSL LR CADR
FIG. B-4 (continued).
-143-
H
m .
n
756 751 758 759 160 lb1 762 143 164 765 766 761 lb8 769 170 171 712 173 174 775 llb 177 718
0146 voooouooo 0147 v 00000011 0148 v c0000070 0149 v 00000001 0 14A V 0000005D 0 148 VOOOOOOOD 0 L4C v 00000011 014c V0000007f 0 14E v00000001 0 14F V0000005E 0150 VOOOOOOOD 0151 voooooool 0 152 VOOOOOOlF 0153 vooooooco 0154 VOOOOOOAB 0155 voooooo58
EXAMPLE 3: HATRIX MULTIRL~ (11 MAY 12) CISCO ASSEMBLER
XR X SNAP PSTOP
ME HF INCA STR WIN EXR MfZMF INCA STR WIN EXR MEMF WIN STR LA1 AI I3
* * +
TEXT TEXT TEXT tNll
LOAD J FRCM JBASE
J
J LOAD AADR FROM ABASE
AADR
AAOK LOAD EIADR FROM BBASE
BADR h!ATRLX//5 t’ATRIX
GO 00 NEXT
XR XSNAP tSTOP
FIG. B-4 (continued).
143. 744. 745. 146. 747. 748. 149. 750. 751. 152. 153. 754. 755. 756. 751. 750. 159. 760. 761. 762. 163.
-145-
A=
-176 0 10 97 -169 -68
18 104 -162 -61 0 111
l 0 -54 32 118 -147 -46
39 0 -140 -39 46 133
-133 17 -171 -134 37 -4
136 162 -1 129 0 -5
B=
-126 -32 54 140 -118 -25
61 147 -111 0 68 154
-104 -10 75 162 0 -3
0 169 -90 3 90 176
-82 68 -92 -15 112 -91
18 9 101 157 65 147
FIG. B-5.--Example 3 matrixdata values
-146-
Flowchart of Example 1: Sifting Sort
I contains the initial table address.
F contains the final table address.
R = 1 when fetching from memory, = 0 when receiving data from master.
S = 1 when data are to be stored into memory (due to unavailability of a slave). After a slave has been attached, S = 0 when slave is ready to re- ceive the next item, = -1 otherwise.
slave start
A -input from master
output ‘0’ send token to master to master
t I
FIG. B-6. Example 1 - detailed flowchart.
.-147-
/+2 last item = m
tOken not yet received
store B at MEMORY [j _
r output B to slave
S--l
* slave start
1 L v
S-l
output S to slave S becomes slave’s R
output I + 1 output F
* 2106A61
FIG. B-6 (continued).
-148-
.
store B at MEMORY [G]
output B to slave
4
store A at MEMORY [I]
I I this processor is done
last processor
this processor restarts
2106A62
FIG. B-6 (continued).
-149-
Initially, P - 12 = 2 kw2d -2
B = address (TABLE (R))
initial K = address (TABLE (R + I))
initial L = K - I
I TN -addr (TABLE (N) )
I I
FORK
4
2106063
FIG. B-7. Example 2 - detailed flowchart.
-150-
STK + STKDEP
pwaddi (TABLE (R+ I)) 1
AL
X--(K) yes
b I
Y contains TABLE (R) &“”
FIG. B- 7 (continued).
-151-
K-K+1
L-K x--w
I
ripple Y
last item
Y-V-J)
I
go to next sequential item
FIG. B-7 (continued).
-15%
2106A65
> + (B+I)-Z z-x
(B + 1)-X L
= 11 Y-Z 1 I I alreadv have
try to get a slave
FIG. B- 7 (continued).
become next processor
2106Md
-153-
Assume register STK, LEN, AADR, ABDR, CADR are initialized
other registers: I=PROD J = INDEX A =AADFZ B = BADR
Stack layout:
STK initial - STKBASE - FLAG IBASE JBASE ABASE BBASE
STK (l)- AADR (1) BADR (1) STKBASE
STK (2)- AADR (2) BADR (2) STKBASE
- .
start 4
I-l
J-l
FLAG-O
ABASE -AADR + LEN BBASE - BADR - LEN + 1 FLAG-O IBASE-I+1 JBASE - 1
ABASE -AADR BBASE - BADR + 1 1 +- FLAG-O IBASE -I JBASE-J+l
FORK
no
1 output
FLAG- 1 STK+3 CADR + 1 LEN
I
FIG. B-8. Example 3 - detailed flowchart.
-154-
(STK) - AADR (SE; 1) r EADR
(CADR)-0 A - (AADR) B-(BADR)
c
4
J out!Jut STI;; + 3
- CADR + 1 LEN
multiply and .sum (CADR) -(CADR) + A * B
1
INDEX -INDEX + 1
I FORK
#O
=o FLAG
s no
yes slave
?
FIG, B-8 (continued).
-155-
APPENDIX C
List of Z-Machine Oneration Codes
Control group
(hex) mnem description
00
01
02
03
04
05
06
07
08
09
OA
OB
oc
OD
OE
OF
STOP deallocate self and wait for startup
WIN wait for input
FORK request allocation of slave
OSLF output data to slave, with flag
DET detach self from master (if any) and slave (if any)
OMAF output data to master, with flag
OMA
OSL
OMEM
MEMR
MEMF
MEMS
MEMM
output data to master
output data to slave
output data to memory start memory
start memory
start memory
start memory
Miscellaneous group
10 TEST 11 INCA
12 DECA
13 1A
14 ALNK
15 LNKl
16 LNKO
17 1Lm
18 AOV
19 TMA
cycle-destructive (“half ‘I) read
cycle-full read (“fetch”) and rewrite
cycle-store
cycle-read and lock (‘*read-modify-write17)
set condition code according to contents of A add 1 to A
subtract 1 from A complement A
add LINK to A
set LINK to A
set LINK to 0 invert link
add OVERFLOW to A
test if MASTER = 0 (i. e. , executing at top level)
-156-
. ,. :
(hex) 1A
1B
1c
1D 1E
1F
Skip group
20
24
28
2c
30
34
38
3c
mnem
TSL
TIF TIN
SKOV SKGT
SKLT
SKEQ
SKNO
SKLE
SKGE
SKNE
Register group
40 AR 48 LR
50 INCR
58 EXR
60 SR
68 NR
70 OR
78 STR
description
test if SLAVE = 0 (1. e. , w slave allocated)
test for flag on last input
test input source (“0” = MEMORY, “< 0” = SLAVE, >0 ” > 0 ” = MASTER)
(lower 2 bits of instruction = n - 1)
Skip n on OVERFLOW = 1 skipnonA> 0 =MA
skip n on A < 0 = SL assembler version:
skip n on A = 0 = MEM
skip n OVERFLOW = 0 SKm XX where
skipnonA< 0 1m XX = two-character condition specfi-
skipnonA> 0
skip n on A = 0 1SL cation
1MEM m =l, 2, 3, or4
(r = 0 ) . . . . ‘. , ‘7) (register 0 is the program counter)
add register r to A load A from register r
add 1 to register r
exchange A with register r
subtract register r from A
AND register r into A OR register r into A
store A into register r
Shift group (lower 2 bits give shift amount: 021 bit, 1 >3 bits, 2 8 bits, 3 (register 1) )
80 SHLLN shift left logical nolink
84 SHLAN shift left arithmetic nolink
88 SHLCN shift left circular nolink
8C SHLCL shift left circular link
-157-
(hex) mnem description
90 SHRLN 94 SHRAN
98 SHRCN
9c SHRCL
shift right logical nolink shift right arithmetic nolink
shift right circular nolink
shift right circular link
Immediate ins true tions
co LI i load immediate (lowey 6 bits of instruction are loaded into A)
A0 AI i append immediate rlower 5 bits of instruction are shifted (left, loiical, nolink) into A from the right.
-158-
APPENDIX D
Result Data Example 1
KlJ&JG
2 13,423 0.552
4 7,412 0.305
8 4,374 0.180
15 3,094 0.121
a 2.15 1 24,413 1.000
2 13,182 0.538
4 1,256 0.296
8 4,257 0.114
15 3,023 0.123
4 10 1 28,408 1.000
2 14,428 0.501
4 8,856 0.312
8 6,148 0.216
15 5,652 0.199
3 7 1 21,354 1.000
25,551 21,952
26,416 21,401
26,896 20,950
25,011 21,077
25,822 21,401
26,143 20,950
26,496 20,163
28,404 23,281
21.444 21,862
32,396 21,263
40,824 20,824
57,112 20,724
27,351 23,281
29,601 21,952
31,968 21,359
35,454 20,932
45,651 20,758
ALLCC. PAR.
-irG---
1.90
3.56
6.15
8.78
1.00
1.90
3.56
6.14
8.76
1.00
1.90
3.66
6.64
10). 21
1.00
1.91
3.64
6.46
9.99
MER / ,,+ 1.64 86
2.89 81
4.19 78
6.70 76
0.95 96
1.67 88
2.95 84
4.92 81
6.87 79
0.82 18
1.52 81
2.40 65
3.39 51
3.67 36
F WAIT 4
14
19
22
23
4
12
16
19
21
21
17
29
43
53
11
24
30
37
48 A
OA % BUS BUS
WAIT B. W. 0 11
0 14
0 22
0 33
0 44
0 3
0 4
0 6
0 9
0 11
I 36
1 54
5 15
6 95
10 91
6 28
3 37
3 57
3 80
5 so
Example 2 4 10 1 38,716 1.000 38.112 21,682 1.00 0.72 68 29 3 58
2 26,888 0.695 45,064 27,762 1.68 1.03 58 34 8 86
4 26.548 0.685 68,668 21,810 2.59 1.05 38 41 21 88
8 26,526 0.685 92,592 21,866 3.49 1.05 28 44 28 88
1 4 1 32,855 1.000 32,854 21,662 1.00 0.84 84 16 0 11
2 20,403 0.620 33,614 27,162 1.65 1.36 83 17 0 28
4 15,543 0.413 33,991 27,810 2.19 1. 79 82 18 0 31
8 13,986 0.425 31,884 27,866 2.71 1. 99 74 23 3 42
t 2 1 30,913 1.000 30,913 27,682 1.00 0.89 92 8 0 9
2 19,051 0.615 31,349 27,762 1.65 1.46 92 8 0 15
3 18,061 0.583 31,154 27,930 1.16 1.55 91 9 0 16
4 13,622 0.440 29.160 21,810 2.18 2.04 93 1 0 21
8 11,614 0.316 30,416 27,866 2.62 2.40 91 8 0 25
Example 3 1 2.125 1
2
4
8
12
4 10 1
2
4
1
1 1 1
2
4
8
11
l-- 38,154
20,736
10,718
6,316
4,927
48,156
28,043
11,460
15,740
46,261
24,512
12,626
1,245
5,916
1.000 38,153
0. 544 37,529
0.281 36,891
0.166 38,550
0.129 40,909
1.000 48,152
0.583 50,796
0.363 62,252
0.321 85,600
1.000 46,260
0.530 44,488
0.273 43,508
0.157 44,724
0.128 46,129
36,611
35,876
34,912
34,504
34,301
36,617
35,194
34,140
34.307
36,617
35,807
34.911
34,445
34.301
1. 00 0. 96
1.81 1. 73
3.44 3.26
6. 10 5.46
8.30 6.96
1. 00 0. 16
1.81 1.28
3.56 1.99
5.44 2.18
1. 00 0.79
1.81 1.46
3.45 2.11
6.17 4.15
7.80 5.80
12
22
39
63
79
39
63
93
98
10
18
33
54
66
-159-
Top Related