UNIVERSITY OF CALIFORNIA Los Angeles AMPIRE …fmdb.cs.ucla.edu/Treports/950016.pdfUNIVERSITY OF...

UNIVERSITY OF CALIFORNIA

Los Angeles

AMPIRE:

Asynchronous Microprocessor with Instruction Retry

A thesis submitted in partial satisfaction of the

requirements for the degree Master of Science

in Computer Science

by

Chia-Chi Chao

1995

! Copyright by

Chia-Chi Chao

1995

The thesis of Chia-Chi Chao is approved.

Milos D. Ercegovac

David A. Rennels

Yuval Tamir, Committee Chair

University of California, Los Angeles

1995

ii

Table of Contents

Chapter One " Introduction ............................................................................. 11.1. Asynchronous Design ............................................................................ 11.2. Fault Tolerance ...................................................................................... 31.3. Scope of Thesis ...................................................................................... 4

Chapter Two " Previous Work ........................................................................ 62.1. Handshake Protocols .............................................................................. 62.2. Asynchronous Processors and Circuits .................................................. 82.3. Micro Rollback ...................................................................................... 11

Chapter Three " Processor Architecture ...................................................... 143.1. Instruction Set ........................................................................................ 143.2. Processor Overview ............................................................................... 163.3. Normal Operation .................................................................................. 16

3.3.1. Instruction Fetch .......................................................................... 163.3.2. Instruction Issue ........................................................................... 203.3.3. Instruction Execution .................................................................. 21

3.3.3.1. ALU ............................................................................... 213.3.3.2. Data Memory Load/Store .............................................. 223.3.3.3. Flow Control .................................................................. 23

3.3.4. Register File Access .................................................................... 253.4. Fault Tolerance ...................................................................................... 26

3.4.1. Sequence Number and Check Vector .......................................... 263.4.2. Error Detection ............................................................................ 293.4.3. Instruction Log and Validation .................................................... 303.4.4. Rollback ....................................................................................... 313.4.5. Delayed Write Buffer .................................................................. 32

3.5. Synchronization ..................................................................................... 343.5.1. Module Level .............................................................................. 343.5.2. Processor Level ........................................................................... 34

Chapter Four " Behavioral Modeling ............................................................ 36

iii

4.1. Building Blocks ...................................................................................... 374.1.1. Tri-State Bus ................................................................................ 374.1.2. Muller C-Element ........................................................................ 374.1.3. Buffer ........................................................................................... 384.1.4. Reset ............................................................................................ 394.1.5. Arbiter .......................................................................................... 404.1.6. Rollback ....................................................................................... 41

4.2. Processor Modules ................................................................................. 444.2.1. Queues ......................................................................................... 444.2.2. Checkers ...................................................................................... 454.2.3. Memories ..................................................................................... 454.2.4. Program Counter ......................................................................... 474.2.5. Arithmetic Logic Unit ................................................................. 474.2.6. Controller ..................................................................................... 48

4.2.6.1. Instruction Issuing Unit ................................................. 484.2.6.2. Reservation Table .......................................................... 52

4.2.7. Delayed Write Buffers ................................................................. 534.2.7.1. REGDWB ...................................................................... 534.2.7.2. MEMDWB ..................................................................... 55

4.2.8. Instruction Log ............................................................................ 564.3. Putting the Processor Together .............................................................. 58

4.3.1. Parameters and Timing ................................................................ 584.3.2. Top-Level Wiring and Testing Module ...................................... 59

Chapter Five " Behavioral Simulation ........................................................... 605.1. Instruction Fetch/Issue and Queue Operations ...................................... 605.2. Delayed Write Buffers and Checker Arbitration ................................... 645.3. Arbitration for R_bus and Out-of-Order Completion ............................ 675.4. Fault Detection and Rollback ................................................................. 705.5. Running a Real Program ........................................................................ 76

Chapter Six " Hardware Design and Gate-Level Simulation ................... 786.1. C-Elements ............................................................................................. 786.2. Arbiter .................................................................................................... 806.3. Register Completion Detector ................................................................ 81

iv

6.4. Instruction Queue ................................................................................... 836.5. Data Queue ............................................................................................. 85

6.5.1. Sequence Number Comparator ................................................... 856.5.2. DQ Control .................................................................................. 876.5.3. DQ Simulation ............................................................................. 89

6.6. Fault Simulation with Gate-Level Modules ........................................... 93

Chapter Seven " Conclusion ............................................................................ 95

Appendix A " Verilog Simulation Code ......................................................... 98

Appendix B " AMPIRE Assembler ................................................................. 196B.1. Assembly Code Format ......................................................................... 196B.2. Assembler Source Code ........................................................................ 196

Bibliography .......................................................................................................... 206

v

List of Figures

2.1. 2-Phase Handshake Protocol ................................................................ 62.2. 4-Phase Handshake Protocol ................................................................ 62.3. Delay-Insensitive 4-Phase Handshake Protocol ................................... 82.4. Request Generation for Double-Rail Data ........................................... 82.5. MIPS as a Function of Supply Voltage for Caltech Processor ............ 92.6. Handshake Model for Berkeley DSP ................................................... 102.7. Micro Rollback, Restoring a Saved Snapshot ...................................... 112.8. Register File with Support for Micro Rollback ................................... 123.1. AMPIRE Instruction Format ................................................................ 143.2. AMPIRE Block Diagram ..................................................................... 183.3. Sequence Number Windows ................................................................ 273.4. Check Vector for ALU Operation ........................................................ 283.5. Validation and Rollback Signal Flow Diagram ................................... 313.6. Asynchronous Delayed Write Buffer for Register File ....................... 333.7. Halt and Rollback Sequence ................................................................ 354.1. Data Transfer Sequence ....................................................................... 435.1. Test 1 with Slow IMEM ....................................................................... 615.2. Test 1 with Fast IMEM ........................................................................ 615.3. Test 1 with Fast IMEM, Close-Up ....................................................... 615.4. Test 1, Detailed Activities .................................................................... 635.5. Test 2 with All Delays=1 ..................................................................... 645.6. Test 2 with Long DMEMWrite Cycle ................................................ 655.7. Test 2 with Slow Memory Checker ..................................................... 665.8. Test 2, Showing Arbitration of Checkers ............................................. 675.9. Test 3 with Slow DMEM ..................................................................... 685.10. Test 3 with Fast DMEM ....................................................................... 695.11. Test 4, All Faults .................................................................................. 705.12. Test 4, CKI Fault .................................................................................. 715.13. Test 4, CKF Fault ................................................................................. 725.14. Test 4, ALU CKR Fault ....................................................................... 735.15. Test 4, CKM Fault ................................................................................ 745.16. Test 4, DMEM CKR Fault ................................................................... 75

vi

5.17. Test 4, Rollback Sequence ................................................................... 755.18. Test 5, Find the Largest Number ......................................................... 776.1. 3-Input C-Element ................................................................................ 786.2. 14-Input C-Element .............................................................................. 796.3. 2-Input C-Element with Clear .............................................................. 796.4. Arbiter for R_bus ................................................................................. 806.5. Gate-Level Arbiter ............................................................................... 816.6. D-Latch ................................................................................................ 816.7. D-Latch with Set/Reset ........................................................................ 826.8. Register Completion Detector .............................................................. 826.9. Control for IQ Buffer ........................................................................... 846.10. Gate-Level IQ ...................................................................................... 846.11. Sequence Number Comparator ............................................................ 866.12. Borrow Circuits, Full and Half ............................................................ 866.13. XOR Gate ............................................................................................. 876.14. Difference (SUB) Circuit ..................................................................... 876.15. Control for DQ Buffer .......................................................................... 886.16. Test 6, Gate-Level DQ with DLY_CKI_CHK=48 .............................. 906.17. Test 6, Gate-Level DQ with DLY_CKI_CHK=52 .............................. 916.18. Test 7, Gate-Level DQ with DLY_DMEM_RD=8 ............................. 926.19. Test 7, Gate-Level DQ with DLY_DMEM_RD=1 ............................. 936.20. Fault Simulation with Gate-Level Modules ......................................... 94

vii

List of Tables

2.1. Average Instruction Periods for Berkeley DSP ................................... 93.1. AMPIRE Instruction Set ...................................................................... 153.2. Processor Modules ............................................................................... 174.1. Frequently Used Verilog Keywords ..................................................... 364.2. DWB Bits in the Instruction Log ......................................................... 57A.1. Verilog Modules ................................................................................... 98

viii

ABSTRACT OF THE THESIS

AMPIRE:

Asynchronous Microprocessor with Instruction Retry

by

Chia-Chi Chao

Master of Science in Computer Science

University of California, Los Angeles, 1995

Professor Yuval Tamir, Chair

As we build faster digital circuits, the clock skew problem becomes a major

limiting factor in large scale synchronous systems. Asynchronous design has an

advantage of not requiring a global clock, and the high modularity that can be achieved

is especially beneficial for large systems. Reliability of a system is an important issue

in many applications, and hardware fault-tolerant techniques can be applied for fast

recovery from transient faults.

This thesis reports the design of an asynchronous microprocessor that supports

concurrent error detection and software-transparent fault recovery. On-chip parity

checkers operate in parallel with other functional units, and when an error is detected,

the state of the processor is restored to the instruction in which the error occurred. A

Verilog behavioral model of the architecture has been written and simulated, and some

gate-level self-timed circuits have also been designed.

ix

Chapter OneIntroduction

1.1. Asynchronous Design

Because of the advances in semiconductor technology, VLSI circuit density and

speed have increased dramatically over the years. In a synchronous design, one of the

limitations on system speed is clock skew, which is the phase difference of the clock

signals at different locations of the system. As the feature size decreases, wire delays

and loadings become proportionally more significant, and the clock skew problem

becomes worse. Clock skew can be minimized with proper distribution, but clock

balancing cannot be done until the chip layout is complete and loadings characterized.

More robust circuits can be designed to be less sensitive to clock skew, but most likely

at the expense of speed and chip area. Furthermore, the maximum clock frequency has

to accommodate the critical paths in the system, even though some stages may be able

to operate at a much faster rate.

Instead of having a global clock for synchronization, each block in an

asynchronous system decides when it is ready to accept a new set of data to be

processed. Various handshake protocols may be used to coordinate the communication,

and they will be discussed in the next chapter. A properly designed self-timed system

can be made very modular. For example, if an application requires a faster adder, a

ripple-carry adder may be directly replaced with a carry-lookahead one. Since the

circuit is self-timed, the rest of the system does not need to be modified; it will just run

faster when add operations are performed. Conversely, any module may be substituted

with a slower version to save power, and there is no clock to be adjusted to compensate

1

for the increase in critical path.

The following examples illustrate some advantages of asynchronous systems. The

torus routing chip developed at Caltech [Dall86] is a self-timed circuit for

multiprocessor interconnections. Due to an oversight in the critical path, the first

silicon could be run only at 4 MHz rather than the expected 20 MHz. However, the

chip still functioned correctly because of the self-timed design. In the DEC PDP-6

computer, an asynchronous adder was used to take advantage of the higher average

performance [Bell78]. It was observed by von Neumann and others that the average

number of carries is log2(width of data ). Therefore, for the 36-bit word size of PDP-6,

the average performance was increased by a factor of 7 because the number of carries is

5.2 on average rather than the worst case of 36. In a recent RISC architecture, the 100

MHz HP PA-RISC CPU employs a self-timed floating point coprocessor due to speed

and area considerations [Wils92]. The multiplier can compute a double-precision result

in 20 ns and consumes only 0.07 cm2 chip area.

However, by not having a global clock, additional handshake circuits must be

added to handle inter-block communication, and self-timed circuits themselves are

more complex, in general. These timing and area overheads for handshaking must be

considered when choosing between a synchronous or an asynchronous path. Also,

designing circuits for proper handshake sequencing is not a trivial task. Several

approaches have been made to synthesize hardware from high-level

specifications [Burn87, Meng89, Moln85].

The difficulty with metastability may seem to be a major drawback for

asynchronous systems. However, as pointed out in [Meng89], asynchronous design

does not introduce more undecidable timing problem. For latching data, the clock in a

synchronous system is made slow enough so that data arrives before the clock edge. In

2

a properly designed, fully asynchronous system, the latching signal is guaranteed to

arrive after the data, based on the handshake methodology. An arbiter, on the other

hand, is inherently metastable because more than one request may become active

simultaneously [Meng89, Seit80]. Asynchronous systems may be more dependent on

arbiters due to the nature of unpredictable timing. However, if fair mutual exclusion is

to be incorporated in a synchronous design, then metastable circuits are usually

necessary.

1.2. Fault Tolerance

In applications where reliability is a major concern, fault tolerant features should

be designed into a system to minimize down time. Since transient faults occur much

more frequently than permanent faults [Cast82], fast recovery from transient faults is a

key to improve system performance. In environments where a high rate of transient

faults is expected (due to radiation, noise, etc.), software-driven corrections may be too

slow and impractical, and hardware fault tolerant features must be added to assist

recovery. The RH32 radiation-hard, 5-chip computer under development by TRW,

McDonnell Douglas, and United Technologies is an example of fault-tolerant

processor [TRW92]. Various degrees of fault tolerance may be achieved by running the

CPU stand-alone, with other chips in the family, or as a processor/checker pair with two

CPUs. Many faults can be detected and corrected under 2 µs (25 MHz clock) without

any software intervention.

To achieve a high degree of fault tolerance, it is often necessary to detect errors as

soon as they occur so that corrupted data is not spread throughout the system. One way

is to check the data in series by delaying execution until verification is complete.

However, performance is reduced because either the cycle time or the number of

3

pipeline stages must be increased. To minimize performance degradation, checkers can

operate in parallel with the functional units. The next pipeline stage can start execution

while data verification is carried out, and the checking result will be available some

time later. This largely solves the problem with delay through the checker, but fault

recovery becomes more complicated because all computations on the corrupted data

must also be discarded. Micro rollback is a technique that utilizes concurrent

verification, and the system is rolled back several cycles in response to a delayed error

signal [Tami88, Tami90a]. Micro rollback will be briefly described in the next chapter,

and for complete discussion, please see the listed references.

1.3. Scope of Thesis

As the title of this thesis suggests, AMPIRE is an asynchronous fault-tolerant

microprocessor. The DLX architecture [Henn90] is the basis for the design because it is

well-known, and its load-store RISC architecture allows ease of implementation in an

academic environment. Asynchronous processors have been built and tested [Jaco90,

Mart89], and a fault-tolerant processor capable of micro rollback has been designed and

implemented [Tami90b]. Therefore, the original goal of this work was to design an

asynchronous processor that supports micro rollback. Due to unexpected complexities

that will be discussed in Chapter 7, AMPIRE rolls back to instruction boundaries only,

not sub-instruction level as in micro rollback. However, many fault-tolerant ideas are

still based on the research results of micro rollback.

In the next chapter, previously published work related to self-timed design and

micro rollback will be reviewed. The AMPIRE architecture and some design decisions

will be discussed in Chapter 3. In order to check if the design is logically correct, a

Verilog model of the processor has been written and simulated. The behavioral model

4

will be described in Chapter 4, and simulation results will be shown in Chapter 5.

Since this processor is not physically implemented, circuit diagrams will be presented

in Chapter 6 to justify some high-level constructs used in the behavioral model.

Chapter 7 will then discuss some issues found in the design process, and what future

work can be done as extensions of this research. The full Verilog code and the source

code for AMPIRE assembler are listed in the appendices.

5

Chapter TwoPrevious Work

2.1. Handshake Protocols

data

req

ack

Figure 2.1: 2-Phase Handshake Protocol

data

req

ack

Figure 2.2: 4-Phase Handshake Protocol

Whereas a synchronous system initiates a task cycle with clocks, an asynchronous

system uses handshake signals to start and stop an activity. The 2-phase and 4-phase

self-timed signaling conventions in [Seit80] are presented in Figures 2.1 and 2.2 in a

slightly modified form. For both cases, the request is sent after data becomes stable.

After the receiver finishes processing the data, an acknowledgement is sent back, and

then data is released. 4-phase handshake is level-sensitive, and it may be slower

because an extra trip is required to disable the req and ack signals. However, 2-phase

handshake needs edge detectors, which results in more complicated circuitry.

Therefore, 4-phase signaling is generally used for local communication, leaving 2-

phase signaling for long distance transactions. Power wise, 2-phase has the advantage

of using half as many signal transitions, and the energy savings may be significant if the

6

interconnection wires are long.

Note that in Figures 2.1 and 2.2, data must be stable before the request signal can

be activated. This is true at the transmitter, but if data and handshake signals are routed

differently, this ordering may not be preserved at the receiver, and therefore violating

the protocol. This type of handshake is valid only in equipotential regions, unless

routing and delays can be carefully controlled. An equipotential region is an area small

enough such that delay through the wire is small compared to signal rise and fall

times [Seit80]. For communication outside of an equipotential region, a delay-

insensitive protocol should be used to maintain correct self-timed operations.

Delay-insensitive design is a subset of the class of self-timed circuits. One way to

achieve delay-insensitivity is by using double-rail encoding [Seit80], in which the

(data,data) pair is: 00 for undefined, 10 for one, and 01 for zero. 11 is not allowed, and

only one of the two signals may switch at any time. A handshake protocol using this

encoding is shown in Figure 2.3. The first cycle transfers a one, and the second cycle

transfers a zero. If a request signal is needed at the receiver end, it can be easily

generated by ORing the double-rail data lines. Figure 2.4 is a circuit that uses a

C !element to merge the multiple internal request signals. Muller C-elements are often

used in self-timed systems and have the following characteristics: when all inputs are 1,

the output becomes 1, and when all inputs are 0, the output becomes 0. Otherwise, the

output remains in its previous state. C-element implementations will be discussed in

section 6.1.

Delay-insensitive circuit removes a timing constraint, but it also carries a high

price tag. For single-rail plus request and acknowledgement protocols, a data bus of

any width only has to add two wires for handshaking. The double-rail method, while

more reliable, requires twice as many signal lines. Therefore, the trade-off is based on

7

data

data

ack

Figure 2.3: Delay-Insensitive 4-Phase Handshake Protocol

req 2

req 1

reqCin 2

in 1in 1

in 2

Figure 2.4: Request Generation for Double-Rail Data

how much control the designer has on signal delays and wire/block placement.

Conversion between single-rail and double-rail signaling is quite straight forward, and a

circuit can be found on [Seit80, p. 257].

2.2. Asynchronous Processors and Circuits

The asynchronous processor developed at Caltech is entirely delay-insensitive,

with the exception of isochronic forks [Mart89]. An isochronic fork is the distribution

of a signal to several receivers, and the differences in delays are assumed to be small

compared to gate delays, like an equipotential region. It is a general purpose 16-bit

microprocessor with load-store architecture and separate instruction and data memories,

with double-rail encoding, 4-phase handshaking as the communication protocol. The

performance profile for the 2 µm version is shown in Figure 2.5. At room temperature,

the chip is functional with the supply voltage as low as 0.35V, and the speed reaches 30

MIPS at 12V when the chip is submerged in liquid nitrogen. All these performance

variations occur without any clock adjustments, since there is no clock.

8

0 2 4 6 8 10 120

5

10

15

20

25

30MIPS

volts

300°K

77°K

Figure 2.5: MIPS as a Function of Supply Voltage for Caltech Processor

Another fully asynchronous chip is a digital signal processor (DSP) designed at

UC Berkeley [Jaco90]. A ripple-carry adder and an iterative multiplier are used to save

chip area and to take advantage of the self-timed circuits. Since an instruction cannot

utilize both the shifter and the multiplier at the same time, these two units are placed in

the same pipeline stage. Therefore, an instruction cycle time is highly dependent on the

instruction and the data being executed. Table 2.1 lists the average instruction cycle

time at various supply voltages.

Vdd Shift Multiply3.6 V 105 ns 440 ns5.0 V 73 ns 337 ns7.0 V 55 ns 260 ns

Table 2.1: Average Instruction Periods for Berkeley DSP

The DSP chip is self-timed, but not delay-insensitive like the Caltech processor.

Single-rail data and request signals are used, as shown in the processor handshake

model in Figure 2.6. After the request is received, the register is clocked to latch the

data. Since there is no feedback from the register about its completion, an assumption

is made on the delay before the signal I (initialize) is raised to start an evaluation cycle.

9

DVI

data

ackreq

LogicTimedSelf!

Reg

CircuitInterconnect

Figure 2.6: Handshake Model for Berkeley DSP

After the self-timed logic block is finished, the data valid DV signal is sent back to

notify the interconnect circuit. A problem was encountered in the initial layout that

resulted in a very long wire for the register clock. The additional delay was long

enough to cause the logic block to start evaluation before the data bits were settled, but

a change in the floorplan solved the problem. This case demonstrates that by using the

more efficient, delay-dependent circuits, some freedom of block placement is

sacrificed.

As mentioned in Chapter 1, handshake circuits are difficult to design manually

because all events have to occur in the correct sequence without, ironically, a clock.

Both processors discussed in this section started with high-level signal descriptions, and

then the handshake/control circuits were synthesized with CAD tools. Even with

synchronous systems, the control blocks in most processors nowadays are built through

hardware synthesis. The various methods are too complex to be covered in this thesis.

Please see these references for details [Burn87, Mart85, Mart86, Meng89, Moln85].

An asynchronous system is not complete without self-timed memory. A self-

timed static RAM is discussed in [Fran83]. The memory array uses conventional six-

transistor static RAM cells, but circuitry is added to support the additional handshake

requirements. Since the RAM cell and sense amplifier already have differential bit

10

lines, similar to double-rail encoding discussed earlier, generating the completion signal

is a very natural extension. The external interface has additional request and

acknowledge lines to be connected to other asynchronous devices, such as a processor.

Only 5.2% of the total chip area is occupied by the self-timed completion detectors.

2.3. Micro Rollback

Micro rollback (in a synchronous system) works by taking snapshots of the state of

a module, and when an error is detected, a valid state is restored using the saved

information. Figure 2.7 is an illustration of micro rollback [Tami90a]. Error is detected

a few cycles later because checkers operate in parallel with the functional units, in order

to minimize performance degradation due to data verification. Micro rollback differs

from instruction retry [Ciac81] because it is based on clock cycles rather than full

instructions. As a result, micro rollback can be independently implemented in each

module that uses the clock to advance its state, regardless of its function or the pipeline

structure.

detectedoccurserrorerror

cycle 17

Micro Rollback

snapshotsnapshotsnapshotsnapshotsnapshotsnapshot

cycle 16cycle 15cycle 14cycle 13cycle 12cycle 11

time

Figure 2.7: Micro Rollback, Restoring a Saved Snapshot

Storing the state of a simple register can be accomplished by adding a controller

and connecting several register elements in a FIFO fashion. However, for a large

11

register file, it is not practical to duplicate the entire block several times to allow rolling

back multiple cycles. Since only one register (or a few registers, depending on the

instruction set) can be written per clock cycle, a delayed write buffer (DWB) is used to

hold the data targeted for the register file until check is complete, as shown in Figure

2.8 [Tami90a].

write

vvvv

write

Register Addresses

Decoder

CAM

Priority Circuit

Bus 2

Bus 1

FIFORegister FileDWB

Figure 2.8: Register File with Support for Micro Rollback

The content-addressable memory (CAM) contains the destination register

addresses, and a valid bit indicates that the corresponding FIFO buffer has a valid data

to be written to the register file. If a clock cycle does not update the register file, the

DWB is simply shifted to the left without setting the valid bit, and when a valid data is

shifted out of the DWB, it is committed to the register file. Therefore, the depth of

DWB determines the maximum number of cycles that can be rolled back, and

verification must be done within that time. When an error is detected, the appropriate

valid bits are cleared, and rollback is achieved because the state changes never reached

12

the register file. The priority circuit is necessary to retrieve the most recent register

data, even if it has not been written to the register file yet, so that other instructions

dependent on the data are not blocked from execution.

The UCLA Mirror Processor [Tami90b] is a fault-tolerant RISC microprocessor

that is capable of micro rollback. In addition to the on-chip parity checkers and DWBs

for error detection and recovery, two processors can operate in lock-step, one master

and one slave, comparing both external signals and internal signatures . It is very

expensive to route tens if not hundreds of internal signals to the pins. Therefore,

interleaved parity bits of the desired signals, called signatures, are generated with

chains of switching XOR cells [Trem89], and the condensed data is then used for

comparison. When a mismatch is found, both processors are rolled back the same

number of cycles. However, certain transient errors, such as a fault in the register file,

cannot be corrected with DWBs alone. Under these conditions, the faulty data in one

processor is replaced with the correct one from the other processor. If both processors

have errors in the same location, then a higher level recovery scheme is necessary.

AMPIRE supports instruction retry, not micro rollback, but DWBs are still used to

postpone write operations to the register file and the data memory, as will be seen later

in this report. All components of the DWB are present: FIFO, CAM, and the priority

circuit, except that they are replaced by their asynchronous counterparts.

13

Chapter ThreeProcessor Architecture

3.1. Instruction Set

A subset of the DLX instruction set as presented in [Henn90] has been chosen for

AMPIRE. The DLX RISC architecture is now widely studied in computer architecture

classes, and it allows ease of implementation. Verilog models of DLX have been built

at CMU [Siew92], and a VLSI implementation has been designed at the Montana State

University with the Berkeley OCT tools and fabricated through MOSIS [Wint92].

opcode rs rt rd function(6) (5) (5) (5) (11)

31 26 20 15 1027 21 16 11 0

opcode rs rd(6) (5) (5)

31 26 20 1527 21 16 0

immediate(16)

opcode(6)

31 2627 0

offset(26)

R!type

I !type

J!type

Figure 3.1: AMPIRE Instruction Format

The AMPIRE instruction format is shown in Figure 3.1, with minor notational

changes from DLX [Henn90, p. 166]. All register-to-register ALU instructions share a

single opcode number, and the specific ALU operations are encoded in the function

field. The AMPIRE instruction set appears in Table 3.1, and their opcode/function

code assignments can be found in the parameter listing in Appendix A. The opcodes

for the DLX instructions were obtained from [Host91]. The new instructions for fault

simulation are placed in slots unused by the DLX.

14

Data Transfer:LHI rd, imm Load high (upper half of register) with immediateLW rd, imm(rs) Load word from Mem[rs+imm]SW imm(rs), rd Store word to Mem[rs+imm]Arithmetic/Logical (register):ADDU rd, rs, rt Add unsignedSUBU rd, rs, rt Subtract unsigned (rs ! rt)AND rd, rs, rt Bitwise ANDOR rd, rs, rt Bitwise ORXOR rd, rs, rt Bitwise XORSLL rd, rs, rt Shift left logical by (rt mod 32) bitsSRL rd, rs, rt Shift right logical by (rt mod 32) bitsSRA rd, rs, rt Shift right arithmetic by (rt mod 32) bitsSEQ rd, rs, rt Set if (rs == rt)SNE rd, rs, rt Set if (rs != rt)SLT rd, rs, rt Set if (rs < rt)SGT rd, rs, rt Set if (rs > rt)SLE rd, rs, rt Set if (rs <= rt)SGE rd, rs, rt Set if (rs >= rt)Arithmetic/Logical (immediate):ADDUI rd, rs, imm Add unsigned immediateSUBUI rd, rs, imm Subtract unsigned immediate (rs ! imm)ANDI rd, rs, imm Bitwise AND immediateORI rd, rs, imm Bitwise OR immediateXORI rd, rs, imm Bitwise XOR immediateSLLI rd, rs, imm Shift left logical by (imm mod 32) bitsSRLI rd, rs, imm Shift right logical by (imm mod 32) bitsSRAI rd, rs, imm Shift right arithmetic by (imm mod 32) bitsSEQI rd, rs, imm Set if (rs == imm)SNEI rd, rs, imm Set if (rs != imm)SLTI rd, rs, imm Set if (rs < imm)SGTI rd, rs, imm Set if (rs > imm)SLEI rd, rs, imm Set if (rs <= imm)SGEI rd, rs, imm Set if (rs >= imm)Flow Control:BEQZ rs, imm Branch to (PC+4+imm) if (rs == 0)BNEZ rs, imm Branch to (PC+4+imm) if (rs != 0)J offset Jump to (PC+4+offset)JR rs Jump to address in rsJAL offset Jump to (PC+4+offset); store (PC+4) in R31JALR rs Jump to address in rs; store (PC+4) in R31Miscellaneous:NOP No operationADDUF rd, rs, rt Add unsigned with fault (bad parity)ADDUIF rd, rs, imm Add unsigned immediate with faultJRF rs Jump to address in rs with faultSWF imm(rs), rd Store word to Mem[rs+imm] with faultTRAP offset Special simulation function

Table 3.1: AMPIRE Instruction Set

15

The immediate and offset values are sign-extended to 32 bits before the

instructions are executed. The TRAP instruction is normally reserved for exception

handling, but it is used by the AMPIRE simulator to print out register values and to

terminate the simulation. Details will be discussed at the end of section 4.2.6.1.

3.2. Processor Overview

The block diagram of the processor is shown in Figure 3.2, with brief descriptions

of the modules in Table 3.2. Delayed write buffers (DWBs) are used to postpone

commitment to the register file and the data memory until all required checks are

finished. Since each module has a different completion time based on the function

being performed, queues are used to improve concurrency. Queues are transparent to

the functional elements so that the number of buffers can be changed without

modifying other parts of the processor.

3.3. Normal Operation

The asynchronous operations without fault tolerance features will be discussed

first. The order of presentation will follow the flow of instructions through the various

modules.

3.3.1. Instruction Fetch

When the processor is reset, the PC is set to 0, and an instruction fetch cycle is

started. As soon as a memory cycle is completed (acknowledged), the PC is

automatically incremented to start another cycle. This process continues until a branch

occurs. Even though AMPIRE only supports 32-bit read/write, the unit of address is in

bytes to be compatible with DLX. Therefore, the PC is incremented by four each time.

16

Module DescriptionALU Arithmetic logic unit. Except for branch/jump computations, all other

arithmetic and logic operations are handled by this module.ARBK Arbiter for accessing the K_bus (not shown). It controls outputs from

the four checkers.ARBR Arbiter for accessing the R_bus.BIGC Big C-element (not shown) for rollback synchronization.CKF Checker for REGDWB outputs to A_bus and B_bus.CKI Checker for instructions executed by the IIU.CKM Checker for data written to MEMDWB.CKR Checker for data written to REGDWB.DMEM Data memory, single-port, organized as 32-bit words.DQ Data queue for memory-to-register transfers.IIU Instruction issuing unit. It is the main controller that decodes

instructions, reads data from the register file, and dispatches operationrequests to other modules.

IMEM Instruction memory, read-only, organized as 32-bit words.IQ Instruction queue.LOG Instruction log, where uncommitted instructions are kept. All checkers

send their results to the log, and the log issues validation signals orinitiates rollback.

MEMDWB Delayed write buffer for data memory operations.PC Program counter. It starts at 0 when the processor is reset, and it is

automatically incremented after each instruction fetch.REGDWB Delayed write buffer for register file operations.REGFILE Register file, with two read ports and one write port. There are thirty

two 32-bit general purpose registers, with R0 being a constant of zero.RESTABLE Register reservation table. All registers being read or written must be

cleared by the reservation table first.

Table 3.2: Processor Modules

After an instruction is retrieved from memory, IMEM sends a write request to the

instruction queue. The time required to read an instruction from memory does not

change much, even for real self-timed memory, unless it actually contains multiple

elements with different access times. On the other hand, the instruction cycles are

variable, from very short NOP to long delay for reservation clearance (discussed in the

next section). With the IQ, multiple instructions can be pre-fetched during long

17

(4)REGDWB

IMEM

(2)DQ

(2)CKM

CKI(2)

(2)CKR

(2)CKF

REGFILE

ALU(2)

ARBR

IQ(2)PC

IIUSEQRSRTRD

RESTABLE

MEMDWB(4)

DMEM

(8)LOG validate

haltrollback

I_bus

D_busR_bus

K_bus

B_busA_bus

(#)=number of buffers

Figure 3.2: AMPIRE Block Diagram

18

instruction cycles, and new instructions can be made available quickly following short

cycles.

When a branch occurs, the pre-fetched instructions have to be invalidated. The

IIU disables the output buffer of the IQ and sends a new address to the PC via I_bus.

The PC_load signal also causes IMEM and IQ to drop all current transactions, and the

IQ is cleared. The IQ size has been chosen to be two so that the latency through IQ is

not excessive after a branch is taken.

Since the speed of IMEM and the size of IQ are both unknown to the rest of the

processor, the value of the PC module cannot be used to determine the address of an

instruction being executed. One solution, as used in AMPIRE, is to add a separate

program counter inside the IIU for instruction logging and branch computations.

Whereas the PC module is incremented after every IMEM access, this internal counter

is incremented when an instruction is read from the IQ, like a synchronous processor in

every instruction fetch stage. In this respect, the PC module is very similar to the

remote program counter as discussed in [Patt83]. When a branch occurs, the value of

the internal program counter is used to calculate the destination address, and both

program counters are loaded with the same new address.

An alternative to having a program counter in the IIU is to store the instructions

and their addresses in IQ. This way, the IIU simply reads both the instruction and its

address at the same time. However, since the IQ can be arbitrarily long, the costs of

additional storage elements for the buffers and wires for routing 32-bit addresses can be

quite high.

19

3.3.2. Instruction Issue

An instruction issue cycle is started when the IIU accepts an instruction presented

by the IQ. Part of the instruction decoding process determines which registers need to

be read or written, and then they are sent to the reservation table for clearance. Because

the processor is asynchronous, the time required to execute an instruction cannot be

pre-determined, and even the order of instruction completion is unknown. Therefore, a

compiler may not be able to schedule instructions correctly, and data hazard avoidance

has to be handled by the processor.

Before an instruction IY that writes to R1 can be issued, there must not be another

instruction IX that also modifies R1 in the pipeline. Otherwise, IY may be completed

before IX and causes a write-after-write error. Reading operation is similar; any

previous instruction that writes to R1 must be completed before R1 can be accessed

again. In fact, all instructions being executed at the same time must be independent of

each other. Note that completion does not imply commitment to the register file.

When a reservation request is received at the RESTABLE, the reservation bits for

all source and destination registers are checked. If all of them are clear, then the

destination register is reserved, and the reservation request is acknowledged.

Otherwise, the acknowledgement is delayed until the appropriate bits are cleared by the

REGDWB. Since R0 is a zero constant, it is always available.

Even though PC and data memory are also state elements, reservations are not

needed. Each PC operation is executed directly and immediately by the IIU, and since

AMPIRE is a load-store machine, memory transfers occur between the data memory

and the register file only. Memory write operations are carried out sequentially through

the MEMDWB, and memory read operations have register destinations which are

checked by the normal reservation process.

20

After reservation, the source registers need to be read from the register file. Even

though the IIU actually communicates with the REGDWB, the existence of REGDWB

is transparent to the IIU. The operations within REGDWB will be discussed later.

Instructions can be divided into three categories: ALU, data memory, and flow

control. ALU and data memory instructions are sent to ALU and MEMDWB,

respectively, and branches/jumps are executed by the IIU itself. Sign extension of all

immediate and offset values are also performed by the IIU because they are required for

branch and data memory address computations.

3.3.3. Instruction Execution

3.3.3.1. ALU

ALU operations are pretty straight forward. Two operands (except one for LHI

and PASS) and a function code are used to produce a result that is to be written to a

destination register. No condition code is used in DLX nor in AMPIRE.

Unlike an ALU for a synchronous processor, there is no pre-determined time limit

for each operation. For example, a bitwise OR can be done very quickly, but a simple

ripple-carry adder needs much longer time for carry propagation. Even though a

synchronous processor can allocate multiple cycles for time consuming operations,

other components in the pipeline must be designed to accommodate it.

21

3.3.3.2. Data Memory Load/Store

All data memory read and write operations are controlled by the MEMDWB.

Address and data (for SW instructions) are sent from the IIU via A_bus and B_bus,

respectively. Based on today’s technology, a memory device is generally slower than

the speed of a processor. After a data memory request is accepted by the MEMDWB,

the IIU can start working on the next instruction instead of waiting for the DMEM.

Therefore, MEMDWB is useful even if fault tolerance is not needed.

Memory writes are queued in the buffers (first-in, first-out) until DMEM is ready

for the next transaction. Each MEMDWB entry contains the data to be written and its

destination address. For read operations, the address to be read is compared with the

ones in the buffers. If a match is found, data is retrieved from the appropriate buffer

and sent to the DQ, without accessing DMEM. If there is more than one match, a

priority decoder selects the most recent value. If there is no match, then data is read

from DMEM in the next memory cycle, even if other write operations are waiting in the

queue. Memory reads are given priority over memory writes so that the reserved

destination registers are cleared as quickly as possible to improve concurrency. Each

DQ buffer entry includes the memory data and its target register number. The register

number is passed directly from the MEMDWB controller to the DQ, without going

through the MEMDWB buffer elements and DMEM.

A drawback for this process is the need for associative comparators. With 32-bit

addresses, the overhead can be quite high. Another penalty is that before comparisons

can be made, the propagation of data in the queue buffers must be stopped, which may

increase the latency through the queue. This side effect will be discussed in detail in

section 3.5.

If associative comparators are not desirable, an alternative is to put both memory

22

read and write requests in the same queue, and process them in the same order as they

are issued. Because there is no comparator to check whether the memory data is

already in the MEMDWB, memory read cannot be assigned a higher priority than

memory write. Since multiple read requests can be accumulated in the queue, each

MEMDWB buffer also has to store two additional pieces of data: destination register

number and a read/write mode bit. The IIU can still go to the next instruction, but

memory read operations have to wait until all previous memory requests have been

processed by DMEM. As will be discussed in the fault tolerance section, memory

writes are held in the MEMDWB until the data have been validated, which means

longer delay for both read and write operations. This method results in simpler

hardware because address comparators are not used, but performance is lower because

of longer reservation waiting period for memory reads, and higher memory traffic.

3.3.3.3. Flow Control

Flow control instructions include conditional branches and unconditional jumps.

These are handled by the IIU directly so that the ALU can be dedicated to perform

"real" computations. This is similar to how branches are handled in the Berkeley

DSP [Jaco90]. Because branches are based on register comparisons instead of condition

codes, they are independent of previous ALU operations except when the registers are

reserved. The IIU does require an internal adder for address calculations.

It is not clear whether the DLX architecture has delayed branch, but for simplicity,

AMPIRE does not support it. Delayed branch is generally used in a synchronous

processor to reduce branch penalties by minimizing pipeline stalls. Although an

asynchronous processor may also benefit from having delayed branch, it makes

rollback more difficult because looking at the instruction in the delay slot alone does

23

not reveal any information about the branch instruction associated with it. Furthermore,

statistics show that less than 50% of the delay slots are usefully filled [Henn90, p. 276].

However, since delayed branch usually improves performance [Henn90, p. 277], a

method to support it in AMPIRE is briefly described here. The reader should go

through section 3.4 on fault tolerance before reading this and the next paragraph. One

way to handle both delayed branch and rollback is to maintain two sets of sequence

number and check vector for each branch in the instruction log. For example, a branch

INSTb has sequence SEQb and check vector Vb , and the delay slot INSTd has sequence

SEQd and check vector Vd . The instruction log has to store both sequence numbers

SEQb SEQd and both check vectors Vb Vd with the branch instruction INSTb . The

entry for the delay slot INSTd is not changed. When a checker clears a bit in vector Vd ,

two bits are cleared in the LOG, one for each instruction INSTb and INSTd . This way,

the branch instruction cannot be validated until all the check bits for its delay slot are

also cleared. The delay slot is already prevented from being validated before its branch

by the FIFO nature of the LOG.

When either instruction INSTb or INSTd is to be invalidated, the processor has to

be rolled back to the branch INSTb . This means the instruction log has to include a

priority circuit to send out the address for INSTb , not for INSTd . The LOG also needs

to have more buffer elements and comparators to maintain the additional sequence

number and check vector for each log entry. In addition, the IIU and/or the LOG

control circuit becomes more complicated because the check vector for INSTd is not

available until that instruction is fetched and decoded, but it must be stored with INSTb

in the LOG.

24

3.3.4. Register File Access

After an ALU computation or memory retrieval is completed, the R_bus is used to

transfer data to the register file. Since both ALU and DQ may have data ready at any

time, the ARBR arbiter is needed to coordinate access to the R_bus. If only one request

is detected, then that request is granted, and the module can start its transaction with the

REGDWB. If both requests are received at the same time, the one which just had the

bus access has to wait until the other module finishes one transaction.

The DQ helps to reduce the DMEM read cycle time by separating the R_bus

arbitration stage from the data memory. Since the R_bus may be busy when the data

read from DMEM is ready, the DQ buffer allows the read transaction to complete so

that the DMEM can start the next memory operation. This is important because

memory access is likely to be the slowest activity in a processor. The overall memory

read time (from IIU to REGDWB) also depends on the depth of DQ. While the DQ

reduces the DMEM read cycle time, it also adds propagation delay through the queue.

The penalty is the highest for isolated single DMEM read operation, but for multiple

DMEM reads in a row, the average performance may be improved due to concurrency

between DMEM and DQ. The depth of DQ needs to be selected based on the expected

memory activity, but in general it should be kept short so that the worst-case penalty is

not excessive.

After a piece of data is written to the REGDWB, the corresponding register

reservation is cleared by sending the register number to RESTABLE. This can be done

before the actual register is updated because the REGDWB controls both read and write

operations. When a read request is initiated by the IIU, the REGFILE is read while the

REGDWB is searched for a match. This is different from MEMDWB because DMEM

is read only if there is a miss in MEMDWB. The register file has two read ports and

25

one write port, and it is designed to be read and written every instruction cycle. This is

not practical for a large and slow memory like DMEM, so MEMDWB is optimized to

reduce memory traffic.

Since the same register in the REGFILE can be read and written at the same time,

and these two operations occur asynchronously, care must be taken to prevent wrong

(partially written) data from being presented to the requester. Because the data in the

REGDWB is not erased until its write cycle is completed, reading the register still

being written to the REGFILE will also result in a match in the REGDWB. By sorting

all matches in REGDWB and REGFILE through a priority decoder, with REGFILE

having the lowest priority, the correct and most recent data is given to the IIU. The

data read from REGFILE is effectively ignored in this case.

3.4. Fault Tolerance

In this section, the instruction rollback process and the hardware features needed

to support it will be described.

3.4.1. Sequence Number and Check Vector

Because each module in AMPIRE runs at its own rate, the sequence of events,

such as the order of completion, is not predictable. Therefore, a sequence number has

to be assigned to each instruction when it is issued by the IIU. All storage elements for

uncommitted data, including intermediate buffers, the RESTABLE, and DWBs, must

store the sequence numbers along with their data. For example, the instruction

R 3=R 1+R 2 has the sequence number SEQx . When the values of R 1 and R 2 are sent

to the ALU, SEQx is sent there, too. When the result for R 3 is delivered to REGDWB,

the same sequence number SEQx is stored in its buffer, since that data is related to the

26

same instruction. When a rollback is to be done, the processor can determine which

instructions have to be invalidated by checking their sequence numbers. The number of

bits that needs to be assigned to the sequence number is determined by the maximum

number of uncommitted instructions that is desired:

uncommittedmax = 2(number of sequence bits ! 1)

An extra bit is used so that a quick comparison can be made to determine whether

an instruction X comes before or after instruction Y . For example, if 3 bits are used for

the sequence number, then there can be up to 4 outstanding instructions, or half of 3-bit

combinations. Figure 3.3 contains graphical illustrations of the sequence number

"windows".

0

3

1

24

67

5

076

54 3

2

1

without rollover with rollover

forbidden regioncurrent sequence number

Figure 3.3: Sequence Number Windows

If n is the number of bits allocated for the sequence number, then there are N =2n

number of possible sequence numbers, and the maximum number of uncommitted

instructions is2N . At any point in time, the sequence numbers of the most recent

2N

instructions are:

k , (k +1) mod N , (k +2) mod N , ..... (k +2N!1) mod N , where 0"k <N.

27

An instruction INSTx is the same as, or newer instruction than INSTy if and only if:

SEQx = (SEQy + j ) mod N , where 0" j "2N!1.

Hence, j = (SEQx ! SEQy ) mod N , which can be obtained using an n -bit subtractor.

The condition 0" j "2N!1 holds true if and only if the most significant bit (MSb) of j is

0. Therefore, if the MSb of the subtraction result is 0, INSTx is the same as or newer

than INSTy . Otherwise, INSTx precedes INSTy . This result is used to determine if an

instruction should be validated or not in the rollback process.

Another piece of information that has to be determined when the instruction is

issued is the verification steps it must go through before that instruction can be declared

valid. For example, all instructions have to pass the instruction checker CKI, but only

memory operations need to be checked by CKM. A check vector is the collection of

check bits, with each bit identifying a particular checker. An example of a check vector

for ALU operation is shown in Figure 3.4. When all the required checks are verified,

that instruction can be validated. The check vectors are stored and updated in the

instruction log.

1 0 1 1

CKM CKF CKICKR

Figure 3.4: Check Vector for ALU Operation

28

3.4.2. Error Detection

Each data word and address in the AMPIRE is protected by an even parity bit.

This simple mechanism is used only to show how rollback is done when an error is

detected. If more protection is desired, better error detection or correction codes can be

used instead. For storage elements such as memories and queue buffers, the parity bit is

simply passed along with the data to the destination. It is up to the receiver to

determine the data integrity. When an instruction or the register file is read, the IIU

sends a request to the CKI or CKF checker, respectively. The two other checkers,

CKM and CKR, are used by MEMDWB and REGDWB to verify the data requests

received by them.

When a module performs a computation, a new parity bit is generated. This is

done in the PC each time the address is incremented, in the IIU for address calculation

and sign extension, and in the ALU before each result is sent to the register file. The

checkers decide whether there is an error by XORing all the bits. Since even parity bit

is used, a result of 1 would indicate an error. Again, more elaborate schemes can be

integrated into the processor.

All checkers share the K_bus for reporting their results to the instruction log, and

the ARBK arbiter is used to grant bus access to each checker in a round-robin fashion.

The components of the K_bus are the sequence number, the checker identification

number, and the pass/fail signal.

29

3.4.3. Instruction Log and Validation

Uncommitted instructions are stored here, and all instruction validations and

rollbacks are determined by this module. The log contains 8 entries, so 4 bits have to

be used for the sequence number, as discussed in section 3.4.1. Each log entry has a set

of data for an instruction: sequence number, instruction address, and check vector. The

instructions themselves are not stored in the log because they are read from IMEM

again in the event of rollback.

When a checker reports a "pass", the appropriate check bit for an instruction is

cleared. When all of the check bits for the oldest entry in the log become zeros, then

that instruction can be validated and deleted from the log. The first-in first-out

sequence must be maintained because once an instruction is validated, it cannot be un -

validated. Because of the asynchrony, INSTx +1 may pass all its checkpoints before

INSTx does, but if an error occurs in INSTx , INSTx +1 must be undone.

An instruction is validated by sending the sequence number to the DWB(s) so that

the data waiting in the buffer can be written to its permanent location. If there is a

piece of data ahead of it in the queue, whether validated or not, then the commitment

process has to be delayed so that write-after-write hazard would not occur. If an

instruction does not modify a storage element, then no validation signal needs to be

sent. In AMPIRE, an instruction can modify at most one register or memory location.

Therefore, the LOG only needs to send the validation signal to one DWB and not

disturb other modules. The added cost is an extra handshaking wire, but the other

DWB not directly involved in the validation is not slowed down for unnecessary

comparisons. This performance decrease will be addressed in section 3.5. However, if

there are many DWBs in the processor, the trade-offs may have to be re-considered.

30

It is important to note that when the validation is issued by the LOG, the entry to

be validated must already be in the appropriate DWB. This is required because

buffering a validation request and waiting for the appropriate data make this problem

more complicated than necessary. This sequence is ensured by sending the checker

request after the data is placed in the DWB queue. Without a "pass" signal from the

checker, the instruction cannot be validated. If there is no checker associated with a

DWB, then a check bit is still used so that the LOG cannot validate an instruction

without receiving an "arrived" signal from the DWB. The CKI checker is placed on the

B_bus instead of the I_bus for the same reason. The IIU first logs an instruction before

sending it to the CKI so that when the checker notifies the LOG, the entry is already in

the LOG to be processed.

3.4.4. Rollback

If a checker reports an error, then that instruction, and all instructions issued after

it, have to be invalidated. The instruction at which the error occurred is re-executed by

loading the PC with its address from the LOG. The rollback request is sent by the LOG

to all modules except PC, IMEM, and IQ, since they can be easily cleared by the IIU as

if a branch is taken. The high-level signal flow diagram of validation and rollback is

shown in Figure 3.5.

LOG

CKF CKMCKI CKR

validate specific DWB invalidate all modules

Figure 3.5: Validation and Rollback Signal Flow Diagram

31

The instructions and operations which have to be invalidated are determined by

subtracting the error sequence number from the sequence numbers in all of the buffers.

If the high bit of a result is 0, then that entry is either the erroneous one or a more recent

one, and it is erased. If the high bit is 1, than that operation is not affected by the error.

An invalidation request is sent to all buffer elements at the same time, and the

comparison process is done in parallel. After each buffer is finished, the

acknowledgement signals from the individual elements are grouped together by one or

more C-elements, from which a module-level acknowledgement signal is generated and

sent back to the LOG.

Since the IIU and all transactions with the IIU (such as reading the register file)

are always working on the most recent instruction, they can be invalidated without any

comparison with the error sequence number. The reservation table is a special case that

deserves attention, since there is no DWB for it. When a reservation is made, the

sequence number is stored in the corresponding entry in the table. Rolling back the

RESTABLE simply means clearing the appropriate reservation bits based on the same

sequence number comparisons.

Before the rollback can actually take place, all components in the processor must

be synchronized. This is the topic of section 3.5. Further details of the rollback process

will be addressed in section 4.1.6 when the Verilog code is discussed.

3.4.5. Delayed Write Buffer

Since the DWB is an important element that allows rollback to be done, its

hardware structure is summarized here. The DWB for REGFILE is shown in Figure

3.6, with an expanded view of the DWB buffer element. It is an extension of the

synchronous DWB shown in Figure 2.8, with two additional fields: sequence number

32

data reg v seq w

CircuitPriority

REGFILE

DATA 1 DATA 2 REG 1 REG 2

REGFILE

SEQLOG

REG 1 REG 2

reg v seq w

SEQLOG

XX S X

v =valid bitw =wait bit

Figure 3.6: Asynchronous Delayed Write Buffer for Register File

and the wait bit.

The X elements are comparators, and the S element is a subtractor as described

before. The bus SEQLOG is driven by the instruction log during validation and rollback

cycles. Since these two activities are mutually exclusive, only one bus is needed.

When an instruction is validated, if seq = SEQLOG , then the wait bit is cleared, and the

data can be written to the register file if all the entries ahead of it are also cleared.

During the rollback cycle, the subtractor S determines if that entry should be

invalidated by controlling the valid bit.

The two X comparators in squares already exist in the synchronous DWB, in the

CAM section shown in Figure 2.8. For MEMDWB, the two DATA buses are reduced

to one, and the two REG buses are replaced by a single memory address bus.

33

3.5. Synchronization

Because all of the operations in AMPIRE are asynchronous, care must be taken

when a transaction requires more than two components to cooperate.

3.5.1. Module Level

AMPIRE contains a lot of buffer elements to balance the different processing

speeds of the various modules, and data can be transferred from one buffer to another at

any time. Many operations involve searching all the buffers in a queue, such as

matching a memory address in the MEMDWB. A comparison cannot be made when a

piece of data is "moving", and this is true for both synchronous and asynchronous

systems.

A comparison cycle is started by requesting all buffers in a queue to suspend their

transactions. After all buffers have acknowledged, then another request is sent to

collect the comparison results, and the interrupted transactions may continue. This

additional delay is the reason that unnecessary search requests should be avoided, as

done in the instruction validation scheme.

3.5.2. Processor Level

Global synchronization is required for the rollback process because all modules

must invalidate the erroneous data. When an error is detected, the LOG first sends out

a halt request to all the modules, except the ones in the instruction fetch stage (PC,

IMEM, and IQ). Each module is responsible for making sure all of its data transactions

are suspended before the acknowledgement is returned. After all modules have

responded, another signal is sent by the LOG for invalidation, performing the real

rollback. The processor is restarted only when the LOG explicitly releases the halt

34

signal, after all rollback steps are completed.

The sequence of halt and rollback events are shown in Figure 3.7. Each request

signal is a single wire from the LOG routed to all modules. Each group of the

acknowledgement signals is fed into the inputs of a large C-element (BIGC), and the

single output goes back to the LOG. Therefore, the LOG has no knowledge of the

number of modules in the processor. Since the C-element function is associative, the

BIGC can be physically distributed as smaller C-elements.

stop outputs restart

load new PCterminate handshakinginvalidate entries

Halt_REQ

Halt_ACK

Rollback_REQ

Rollback_ACK

Figure 3.7: Halt and Rollback Sequence

35

Chapter FourBehavioral Modeling

In order to verify that the AMPIRE design is architecturally correct, a behavioral

model of the processor has been built with the Cadence Verilog hardware description

language. Some selected modules and submodules have also been designed and

simulated at the gate level to show how they may be implemented in VLSI, and they

will be discussed in Chapter 6.

The complete Verilog code can be found in Appendix A. Verilog resembles

procedural programming languages such as C and Pascal, with extensions for hardware

simulation. Some frequently used statements and expressions are listed in Table 4.1 for

reference. For detailed information, please see [Cade91].

Keyword Explanation# Delay execution for a specified time.

Delay execution until an event occurs (edge sensitive).@always Statement is executed repeatedly until simulation is terminated.disable Abruptly terminate a block of statements.fork-join Statements within the structure are executed in parallel.initial Statement is executed once when simulation is started.reg A register variable that holds value.wait Delay execution until expression becomes true (level sensitive).wire A net for signal declaration and connection.

Table 4.1: Frequently Used Verilog Keywords

36

4.1. Building Blocks

The basic code elements and techniques which are widely used in the various

processor modules will be presented in this section.

4.1.1. Tri-State Bus

An output can be tri-stated by assigning a high-impedance value z to the output

variable, and the following is the syntax for conditional assignment:

variable = expression ? value if true : value if false;

Therefore, a tri-state bus connection (assuming 33-bit wide) can be specified in Verilog

continuous assignment as:

assign bus_variable = enable_signal ? output_value : 33’b z;

This way, multiple sources can be connected to the same bus, with the restriction that at

most one may be enabled at any time.

4.1.2. Muller C-Element

The following code simulates a 2-input C-element:

always @(input_1 or input_2)if (input_1 & input_2)

output = 1;else if (˜input_1 & ˜input_2)

output = 0;

The first line activates the always block only when either input changes its value.

The rest of the code simply follows the behavior of the C-element. When both inputs

are high, the output is high; when both inputs are low, the output becomes low. The

code segment can be easily expanded to handle C-elements with more than two inputs.

An example is the bigc.v that has 14 input ports. Since the C-element contains state

information, its state (output) has to be initialized in the reset routine.

37

4.1.3. Buffer

Many buffers are used in AMPIRE to improve concurrency because each module

runs at its own speed. Buffers are also needed to store uncommitted instructions and

data. The basic buffer structure is shown below, divided into input and output sections.

always wait (req_in & ˜valid)begin :input_cycle

#1;data_out = data_in;ack_in = 1;valid = 1;wait (˜req_in);#1;ack_in = 0;

end

always wait (valid)begin :output_cycle

#1;req_out = 1;wait (ack_out);#1;valid = 0;req_out = 0;wait (˜ack_out);

end

Each input and output cycle follow the 4-phase handshake convention discussed in

section 2.1, and each one can be considered as a process executed in parallel. The input

cycle is started by receiving an input request REQin . When data is latched, the input is

acknowledged, and the output cycle is initiated by setting the valid flag. The input

cycle is also guarded by valid so that a new input would not be accepted until the

previous data has been sent to another unit. The #1 statements simulate the delays

between the input and output signals, and they allow the Verilog simulation clock to

advance so that the sequence of events can be observed.

A simple asynchronous queue can be built by stacking the buffer elements. A

piece of data added to the queue would ripple toward the other end until it hits another

38

data entry. The queues used in AMPIRE require additional controls because of the

need to support rollback and other functions, and they will be discussed in separate

sections. As with the C-element, the internal variable valid and the external output

signals have to be initialized when the processor is reset.

4.1.4. Reset

Just like most digital systems, some state elements have to be initialized when the

system is reset. The following reset routine provides the necessary initialization steps

for both the C-element and the buffer discussed in the previous two sections:

always wait (reset)begin

disable input_cycle;disable output_cycle;valid = 0;ack_in = 0;req_out = 0;output = 0; // for C-elementwait (˜reset);

end

Disable is a convenient way for an interrupt handler, such as this reset routine, to

terminate other concurrent procedures. These disable commands are needed because a

reset may occur in the middle of a data transaction, not just when "powered-on". Since

the various modules may receive the reset signal at different times, and the time

required to complete the reset process may be different, the disabled routines must not

continue until the reset signal is withdrawn. This can be done by adding reset as one

of the enabling conditions:

always wait (req_in & ˜valid & ˜reset)always wait (valid & ˜reset)

Since no handshake is involved in the reset process, the reset signal must be

applied long enough for all modules to recognize it and complete the initialization. For

39

convenience, all reset routines in the behavioral level are executed in zero simulation

time. However, gate-level modules do have minimum reset pulse-width requirements,

which will be discussed in Chapter 6.

4.1.5. Arbiter

An arbiter is needed whenever two or more devices may want to access the same

resource at the same time, which can happen because of the asynchronous nature of

AMPIRE. As described before, the two bus arbiters ARBK and ARBR control K_bus

and R_bus, respectively. Some modules, such as LOG and MEMDWB, also require

internal arbiters for sequencing control.

Arbiters can be divided into two categories: prioritized and non-prioritized. Non-

prioritized arbiter is fair and operates in a round-robin fashion. It can be represented in

the following code:

always wait (req1 | req2 | req3)begin

found = 0;while (˜found)

begingrant = grant + 1;if (grant >= 4)

grant = 1;case (grant)

1: if (req1) found = 1;2: if (req2) found = 1;3: if (req3) found = 1;

endcaseend

[process grant handshake]end

If there are only two request lines, the code can be simplified as shown in the

arbr.v listing in Appendix A. Sometimes a prioritized arbiter should be used because

of system requirements, or just for better system performance, as done in log.v and

memdwb.v . Some simple nested if-else statements can be used to describe its behavior:

40

always wait (req1 | req2 | req3)begin

if (req1)[process request 1]

elseif (req2)

[process request 2]else

[process request 3]end

4.1.6. Rollback

The rollback scheme was introduced in the last chapter, but some details were left

off because looking at the behavioral code should provide more insight on how the

rollback works. The following is the basic structure:

always wait (halt_req)begin :rollback_cycle

#1;halt_ack = 1;wait (rollback_req);#1;disable input_cycle;disable output_cycle;ack_in = 0;req_out = 0;

diff = sequence_num - sequence_error;if (˜diff [high_bit])

valid = 0;

rollback_ack = 1;[finish halt and rollback transactions]

end

The rollback cycle follows the sequence of events shown in Figure 3.7. Each

module (and sub-module) that needs to roll back must provide its own rollback handler

that supports this protocol. After the halt request is received, the current data

transactions are stopped by disabling the handshake outputs. The halt

acknowledgement is sent without any feedback from the I/O routine/circuitry because

local signal propagation can be easily controlled. The actual instruction invalidation is

41

not performed until the rollback request is given, which means all modules in the

processor have acknowledged the halt request issued by the instruction log.

Whether a data entry should be deleted is determined by the sequence number

comparison, as discussed in sections 3.4.1 and 3.4.4. If the high bit of the result is zero,

then the entry is invalidated. After the rollback of this module is completed, an

acknowledgement is sent, and the rest of the code just finishes the 4-phase halt and

rollback handshake cycles.

Suspending input and output cycles as part of the rollback process can be

accomplished by adding wait (halt_req ) statements before each group of output

commands, as shown below:

always wait (req_in & ˜valid & ˜halt_req)begin :input_cycle

#1;wait (˜halt_req);data_out = data_in;ack_in = 1;valid = 1;wait (˜req_in);#1;wait (˜halt_req);ack_in = 0;

end

always wait (valid & ˜halt_req)begin :output_cycle

#1;wait (˜halt_req);req_out = 1;wait (ack_out);#1;valid = 0;wait (˜halt_req);req_out = 0;wait (˜ack_out);

end

As can be seen in the code or in the 4-phase handshake diagram shown in Figure

2.2, the input cycle alternates between waiting for request and sending

42

acknowledgement, and vice versa for the output cycle. Once the output cycle issues a

request, it cannot drop the request abruptly without receiving an acknowledgement first,

or the handshake protocol would be violated. The only exceptions are: (1) when both

parties of the transaction are notified before the cancellation, as in the case of rollback,

or (2) when a local or global reset occurs, as for the instruction fetch modules (PC,

IMEM, and IQ) during a branch.

The rollback process deletes all the data associated with the erroneous and more

recent instructions. There is no problem with that because every data entry is tagged

with a sequence number. However, care also must be taken to ensure that the processor

is in a correct state for the data not invalidated in the rollback. Specifically, there must

not be any deadlock, data loss, or data duplication. The three cases are discussed

below:

There is no deadlock because all transactions are cleared during rollback and

restarted afterwards if necessary. Therefore, no module is left in the middle of a

handshake cycle waiting for a signal that will never come.

There is no data loss because all data transfers follow the sequence shown below.

The possibility of duplication in the second phase is eliminated by the next step.

validinvalid

validvalid

invalidvalid

Receiver:Sender:

Figure 4.1: Data Transfer Sequence

In the receiving end, the valid bit is set when acknowledgement is sent, as can be

seen in the code for the input cycle. For the sender, the valid bit is reset when

acknowledgement is received. In order to avoid duplication, the sender must wait

for and respond to the acknowledgement signal after the transfer request is sent.

43

This is consistent with the 4-phase handshake protocol. Note that there is no

wait (halt_req ) blocking the statements between req_out=1 and valid=0 in the

output cycle. If the receiver detects the halt request before acknowledgement is

sent, the sender still has the valid bit set, and the transaction will be restarted after

the rollback cycle, if data is not invalidated. Otherwise, the transfer is treated as

completed, even if the 4-phase handshake protocol is not used to release the

request and acknowledgement signals.

4.2. Processor Modules

Now that we know how the basic elements work, the complete modules will be

discussed by their functional groups. Only code segments will be shown here as

needed. Please see Appendix A for the full Verilog model.

4.2.1. Queues

There are two dedicated queues: instruction queue (IQ) and data queue (DQ).

Other internal queues are very similar. The IQ contains only two buffers, but it also

needs to support local reset to invalidate pre-fetched instructions whenever a branch

occurs. The cancellation cycle is similar to the reset routine, except that the

cancellation is performed with the 4-phase handshake protocol to notify the IIU that

invalidation is finished. A C-element is used to combine the two cancellation

acknowledgement signals from the buffers, as can be seen in iq.v , at line 35. When a

rollback occurs, the IQ is simply cleared with the same signal, so a separate rollback

routine is not needed.

The DQ is a combination of a queue and a bus-access handler. The queue buffer

includes a rollback routine to selectively invalidate data entries. The bus handler sends

44

a request to the ARBR arbiter and then proceeds with the actual transfer after grant is

given. The bus handler also needs to support rollback because data is latched before the

bus cycle is started.

4.2.2. Checkers

All four checkers are functionally and structurely identical, and they differ only in

the number of data bits being checked and in their identification numbers. Each

checker has an internal queue with depth of two, and of course, all components of a

checker must support rollback. Errors are detected by XORing all the bits in a word

along with its parity bit, as in the following code:

parity = 0;for (loop=0; loop<1+DATA_WIDTH; loop=loop+1)

parity = parity ˆ data_buf[loop];

If the resulting parity is one, then an error has occurred because even parity is

used. For checkers with two data words, the two parity bits are generated individually.

After a bus grant is given by ARBK, the checker sends the result to LOG for validation

or invalidation.

4.2.3. Memories

The three memory modules are: IMEM, DMEM, and REGFILE, which all use the

Verilog memory structure. The IMEM is read-only and pipelined so that the PC can

start its next cycle when the instruction is sent to IQ. The pipelined handshake is

handled here by a fork-join structure rather than separate input and output cycles as in

the buffer:

45

always wait (in_req)begin

data = imemory [addr[ADDR_WIDTH-1:ADDR_IGNORE]];fork

beginin_ack = 1;wait (˜in_req);in_ack = 0;

endbegin

out_req = 1;wait (out_ack);out_req = 0;wait (˜out_ack);

endjoin

end

Because the the unit of address is byte, but the the memory is organized as 32-bit

words, the ADDR_IGNORE parameter is used to eliminate the two lower address bits.

DMEM is not pipelined in the read mode because the destination register and the

sequence number from MEMDWB are routed directly to DQ and not through DMEM.

This way, the data from MEMDWB are not released until they are latched in DQ, along

with the data read from DMEM. The following is the non-pipelined handshake, in

contrast to the pipelined version used by IMEM:

always wait (in_req)begin

data_out = dmemory [addr[ADDR_WIDTH-1:ADDR_IGNORE]];out_req = 1;wait (out_ack);in_ack = 1;out_req = ZZZ; // release requestwait (˜in_req);in_ack = 0;wait (˜out_ack);

end

The input is acknowledged after the output acknowledgement is received. Note that

the request signal is released by tri-stating it because it is shared with MEMDWB. That

line is pulled down by an external source.

46

Since the IMEM is read-only, the program is loaded from the instruction file only

when the simulation is started, in an initial block. The same file is also loaded into

DMEM so that memory data defined in the program can be accessed. However, the

loading is done in the reset routine because DMEM has to be initialized when a reset

occurs. Both IMEM and DMEM have "error correction" capability to re-calculate the

parity bit after a rollback, to simulate transient memory faults. The correction is

triggered by the retry flag from IIU.

While IMEM and DMEM are single-ported memory, the REGFILE has three

ports, supporting two reads and one write simultaneously. Even though the register R0

is a zero constant, operations that write to it are carried through as usual, but reading

that register always results in zeroes, as determined by these two statements:

rd_data1 = rd_reg1 ? rfile [rd_reg1] : 0;rd_data2 = rd_reg2 ? rfile [rd_reg2] : 0;

4.2.4. Program Counter

The PC module is a loadable incrementing counter. During normal operations, the

counter is incremented by four after each instruction fetch, and a new parity bit is

generated. When a branch occurs, the load cycle interrupts the increment cycle and

obtains the new address from IIU.

4.2.5. Arithmetic Logic Unit

The ALU itself is very simple because Verilog operators are directly used for all

arithmetic and logic functions except SRA (shift right arithmetic), which is

implemented by:

47

for (loop=0; loop<(data2 % DATA_WIDTH); loop=loop+1)out_data = {data1[DATA_WIDTH-1], data1[DATA_WIDTH-1:1]};

The whole word is shifted to the right, but the highest bit, or sign bit, is kept the

same. After each computation, a new parity bit is generated to accompany the data to

its destination. For fault simulation, if the requested function is ADDUF (add unsigned

with fault), then the parity bit is initialized to 1, which would result in an incorrect

parity bit:

parity = (func == FUNC_ADDUF);

The rest of the ALU module contains an input queue and routines for handshake and

rollback.

4.2.6. Controller

4.2.6.1. Instruction Issuing Unit

The AMPIRE controller is consisted of the instruction issuing unit (IIU) and the

reservation table (RESTABLE). In addition to performing the steps between

instruction fetch and execution, the IIU also handles branches and jumps. Within IIU,

the general sequence is: receive the instruction, decode, log, notify CKI, check

reservation, read registers, notify CKF, and send request to an execution unit.

The "fetch cycle" actually just accepts the instruction from the IQ. The instruction

and its address from the internal PC are latched in the two bus output buffers before the

internal PC is incremented. Then parity bit for the new PC is calculated, and the

go_decode flag is set to start the decoding routine.

One of the steps in the decoding cycle is determining the check vector so that

instruction can be logged. This is done with the case structure at line 163 of iiu.v . As

48

discussed in section 3.4.3, the instruction must be accepted by LOG before it can be

sent to the CKI checker. Therefore, the IIU waits for the log acknowledgement at line

186 before starting the CKI handshake in the fork-join block.

Depending on the opcode, different tasks , or subroutines, are called to perform the

appropriate functions. All register-to-register ALU operations are assigned a single

opcode number (0), and they have to be decoded further using the function field, at line

244. Note that NOP (binary 0) is also decoded in this group because its opcode field is

the same as the special opcode. All ALU instructions are processed by the ALU

control task, which reserves and retrieves the necessary registers. Before these two

steps can be taken, the three output variables reg_rs , reg_rt , and reg_rd must be setup

properly with data from the instruction word. Any unused field is assigned a zero since

R0 is always available for the reservation process. The two source operands are latched

in the bus output buffers by these two statements:

abus_out = abus_in;bbus_out = (opcode == OP_SPECIAL) ? bbus_in :

{{(DATA_WIDTH-IMM_WIDTH){imm[IMM_WIDTH-1]}}, imm};

Abus_in and bbus_in are the bus input registers containing values read from the

register file. The first operand (abus_out) is always a register value as determined by

the instruction set. The second operand (bbus_out) is either a register value or a sign-

extended immediate data from the instruction word. A conditional assignment based on

the opcode is used to determine the proper selection. Finally, the ALU handler is in

charge of generating the parity bits and carrying out communication with the ALU

module.

The two tasks for memory and branch operations should be pretty straight forward.

For jumps, the JAL and JALR instructions also need to have the previous PC stored in

R31. Therefore, PC is stored in the output buffer at line 341 before the new PC is

49

calculated. After the new internal PC is loaded into the PC module, the previous PC is

sent to ALU and then passed to the register file.

The PC handler updates the PC module and invalidates the pre-fetched

instructions in IQ. When the inst_en line is pulled low, the output buffer of IQ is

disabled, and the internal PC is gated to I_bus. The request signal that initiates the PC

module loading cycle also cancels all current transactions in IMEM and IQ. The

acknowledgement signals from all three modules have to be received by the IIU before

the next handshake step is taken. This is simply another C-element.

The rollback routine for IIU is slightly different from other ones because it has to

initiate recovery after rollback. After the halt request is issued, it has to wait for any

current PC loading cycle to finish before it can proceed, which is accomplished by line

94:

wait (˜newpc_req & ˜pc_ack & ˜imem_ack & ˜iq_ack);

The PC handler must be allowed to finish its 4-phase handshake cycle because the

same interface is to be used for loading the address being rolled back, and the three

instruction fetch modules must be left in a usable state. After all modules have rolled

back, as indicated by the LOG releasing the roll_req signal, IIU updates the sequence

number and the internal PC from data stored in the LOG:

seq_num = val_seq;int_pc = a_bus;

Going back to the fetch cycle at line 140, there is an "extra" enabling condition

inst_en . That is necessary to prevent the fetch cycle from being activated before a

valid instruction is available from IQ after rollback. The inst_en signal is reset in the

rollback cycle and enabled in the PC handler, after the rollback address is loaded in the

PC module.

50

A few fault simulation instructions have been added to test the fault-tolerant

portions of the processor. However, since the processor is rolled back to the instruction

where error occurred, simply executing it again would cause the same error and result

in an infinite loop. A retry flag is set in the rollback cycle to indicate that the next

instruction to be executed is the second attempt. That flag is then used to "correct" the

error so that rollback would not be triggered again for the same instruction. The code

can be found at lines 275, 306, and 352.

Another instruction used for simulation is TRAP (as is done in the DLX simulator)

that is supported by the task at line 375. It is normally used to call exception handling

routines in the DLX, but AMPIRE does not support interrupt/exception. Traps from 0

to 31 cause the simulator to print the register values in decimal, hexadecimal, and

binary formats. Trap BLANK (90) simply prints a blank line for output formatting.

Finally, trap STOP (100) is used to signal the end of program. No new instruction will

be issued afterwards, but the ones being executed are allowed to finish within the time

pre-determined in the test setup.

Note that the sequence of executing multiple handshake cycles in the IIU is not

optimal for concurrency. For example, the reservation cycle is completed before the

register file can be accessed because these two routines are in separate tasks. In reality,

the register file request can be sent as soon as the reservation is acknowledged, and the

rest of the handshake cycle can be finished in parallel. However, doing so in the

already complex IIU module would make the code less readable and less modular.

Higher concurrency can be more easily explored in the actual implementation, or in

terms of signal transition specifications [Meng89].

51

4.2.6.2. Reservation Table

The need for register reservation was discussed in section 3.3.2. The reservation

table uses two Verilog memory structures, one for the reservation bits, and another one

for the storage of sequence numbers, both addressed by the register number. When a

reservation request is issued, the first step is to wait until all three reservation bits are

cleared:

wait (˜res_table [reg_w] & ˜res_table [reg_r1] &˜res_table [reg_r2]);

Even though any of the reservation bits may be cleared at any time, there is no

problem with glitches because any change must be from 1 (reserved) to 0 (not

reserved). Furthermore, only one bit can be changed at any time because there is only

one clearance port. Then the destination register is reserved by setting its reservation

bit and storing the sequence number in the corresponding slot:

res_table [reg_w] = 1 & (reg_w != 0);seq_table [reg_w] = seq_num;

The first line above also makes sure that R0 is never reserved because an

instruction writing to R0 does not have any effect on the register file. When a rollback

occurs, each sequence number is compared with the error sequence number, just like

other storage elements. A for-loop cycles through all of the slots sequentially, but in

reality, associative comparators should be used. Invalidation is accomplished by

clearing the appropriate reservation bits.

52

4.2.7. Delayed Write Buffers

A DWB provides a temporary storage for data not yet committed to its permanent

target so that if an error occurs, rollback can be made by deleting the appropriate data

from DWB. The two DWBs are the two largest modules in AMPIRE in terms of code

size, but a lot of the code is related to submodule declarations and connections.

4.2.7.1. REGDWB

In the REGDWB controller module, there are five main functional cycles in

addition to the standard reset and rollback routines: input, check, clear, commit, and

read. The input cycle directs the queue to accept data and then enables the check and

clear cycles, which requests CKR to check the data and clears the reservation at

RESTABLE, respectively. The input routine is guarded with the go_check and

go_clear signals to prevent it from starting another cycle before the previous one is

completed.

When the oldest entry in the queue reaches the last buffer, its output request would

start the commit cycle. However, it does not mean that data is ready to be written to the

register file unless the wait bit has been cleared by LOG. After that bit is cleared, a

write request is sent to the register file. This step is not pipelined because data in the

queue cannot be deleted until it is written to the register file. This is done so that a read

operation always has a valid data to be read. If the commit cycle is pipelined, then the

most recent data may be held in a buffer that cannot be searched, or data is just being

written to the register file and not stable enough to be read properly. Also, the two

com_ack transitions are guarded with the rd_req signal so that data cannot be dropped

in the middle of a read operation.

The read request from IIU goes directly to the REGDWB queue and REGFILE,

53

but the two acknowledgement signals do not return to IIU but are used to activate the

read cycle in the controller instead. If there is a match in the buffer, then that data is

gated to the bus. Otherwise, data from the register file is selected. Multiple matches in

the queue are resolved by a priority circuit, but that is transparent to the controller. The

read cycle also supports fault simulation by inverting the parity bit if the sim_f signal

from IIU is active.

The regdwbq module connects the four buffers in a usable fashion with four C-

elements to handle four sets of acknowledgement signals and a priority routine to

deliver correct data to the REGDWB controller. When multiple matches are found, the

casex structures select the latest entries based on the match results. The two source

registers are handled separately.

In an individual buffer, the validate cycle clears the appropriate wait bit when its

corresponding instruction is validated by the LOG. The find cycle is responsible for

checking if its buffer contains the data for the register being read. Note that a match

signal for R0 must be suppressed because that register cannot be overwritten, although

any instruction may try to do so. Otherwise, a non-zero value may be returned for R0

because data in the buffer has higher priority than the register file.

The input and output cycles are the main handshake routines for data transfers,

similar to the ones in section 4.1.6. The major difference is that the DWB buffer has to

pause when the validate and find operations are active. Both of these two functions are

associative searches, and the validate cycle also modifies a bit in the buffer. Before a

comparison can be made, the input and output routines must be stopped to keep the data

stationary. Therefore, each output group is guarded by this statement:

wait (˜val_req & ˜val_ack & ˜read_req & ˜halt_req);

54

Because of this requirement, each search may slow down the normal flow of data in the

queue. This penalty was discussed in section 3.5.1.

4.2.7.2. MEMDWB

The queue for MEMDWB is basically the same as the one for REGDWB, except

that the register numbers are substituted with memory addresses for data tracking and

searching. The controller has to be modified to handle a single-ported memory rather

than a three-ported register file.

Each memory transaction can be either a read or write, and the input cycle has to

process the data accordingly. Memory writes are queued in DWB, but memory reads

are not buffered in order to reduce reservation waiting time, as explained in section

3.3.3.2. For a read operation, the read routine is enabled by setting the go_read flag,

and then a check request is sent to CKM. For a write operation, data is sent to the

DWB queue.

After the read cycle is activated, the first step is to search DWB for any data

destined to the same memory location. If there is a match, then data is retrieved from

the buffer and sent to DQ directly, and a memory access is saved. If there is no match,

then DMEM has to be read. Since DMEM is a single-ported memory, only one read or

write operation can be done at once. Therefore, an internal arbiter is used for mutual

exclusion, with priority given to the read request for higher performance, as registers

are reserved for loading from memory. Even though both MEMDWB and DMEM may

send data to DQ, there is at most one outstanding memory read operation at any time,

so no arbitration is required for accessing the DQ.

Interactions between the read and commit cycles need some explanation. The read

cycle issues a search request and then waits for arbitration. At the same time, the

55

commit cycle may be writing to DMEM and will then acknowledge the queue to delete

the committed entry. However, as explained in the last section with REGDWB, the

input and output routines must be suspended when the queue is searched. The question

is whether there is a possibility of deadlock since the commit cycle is not able to delete

the entry from the queue. The following code is taken from the commit_cycle in

memdwb.v , but not as a continuous block:

wait (out_ack);com_ack = 1;out_req = 0;arbwr_req = 0;wait (˜com_req);

After DMEM has completed the write cycle (out_ack becomes 1), an

acknowledgement com_ack is sent back to the queue for deletion. At this time, both of

the DMEM and arbitration requests are released, so the read cycle is able to continue

after grant is given by the arbiter. The commit routine cannot finish its cycle until the

queue search is completed, but that does not interfere with the read operation.

4.2.8. Instruction Log

The instruction log stores every instruction issued by the IIU until each one is

completely checked and validated. If an error occurs, the LOG contains information

necessary to roll back the processor and re-execute the offending instruction. The LOG

can be split into a controller and an actual storage log that is organized as a queue. The

IIU logs an instruction by communicating directly with the log, not through the

controller.

The controller has two main functions: validation and rollback. Each checker

reports to LOG with the checker ID, the sequence number, and the pass/fail signal. If

there is no error, then the appropriate check bit in the log is cleared. If the check failed,

56

a rollback cycle is activated, using the sequence discussed before. Each module that

needs to roll back must monitor these signals and respond accordingly. An arbiter is

needed because the sequence number output port is shared between the validation and

rollback routines.

When the check bits of the oldest entry in the log are all cleared, then that

instruction can be validated. Instruction validation involves notifying one of the DWBs

and deleting its entry from the log. Data from the check vector, format shown in Figure

3.4, is used to determine which DWB contains the waiting data. Since there are two

DWBs in AMPIRE, the CKR and CKM check bits are stored separately as DWB bits,

which are not reset by the check bit clearance process. The four possible combinations

are shown in Table 4.2.

<CKR,CKM> Target DWB Instruction Examples00 None NOP, Conditional Branches01 MEMDWB SW (store word)10 REGDWB LHI, All ALU Instructions11 REGDWB LW (load word)

Table 4.2: DWB Bits in the Instruction Log

Cases 01 and 10 should be obvious. For the instruction LW, it has to go through

both checkers CKM and CKR, but the data must be in REGDWB when the instruction

is validated. By sending the validation signal to only one DWB, as indicated by the

DWB bits, unnecessary queue searches are avoided. If the DWB bits are 00, that

instruction does not cause any changes to the permanent storage modules, and it can be

simply deleted from the log.

When the processor is rolled back, the instruction address matching the error

sequence number is retrieved from the log and sent to IIU via A_bus. That is done in

57

the rollback cycle of the log buffer, at line 354 of log.v :

if (valid & (out_s == clr_s))tri_addr = out_d;

There is no conflict with multiple matching because each valid log entry has a unique

sequence number within the current "window". The tri_addr register is put back to

high-impedance state after the rollback cycle.

4.3. Putting the Processor Together

4.3.1. Parameters and Timing

The parameter file declares the constants used by all processor modules,

including the opcode assignments, instruction field widths, module delay factors, etc.

The DLX instruction opcodes are taken from the documentation in [Host91] so that the

same instructions would have identical binary code. Most importantly, the behavior of

the processor can be modified by adjusting the delay parameters. For example, the

order of instruction completion can be changed by increasing the access time of the

DMEM module, which will be shown in the next chapter with simulation results.

Whether the processor is delay-insensitive or just self-timed depends on the actual

implementation at the transistor level. The behavioral model of AMPIRE is self-timed

in the sense that the delay values in all modules may be changed without affecting the

processor’s functionality and correctness, but it is not delay-insensitive because single-

rail data and requests are used, as discussed in section 2.1. Of course, delays cannot be

placed anywhere because logical grouping of some statements must be observed. Most

delays are set at 1 to minimize idle simulation time while allowing the Verilog clock to

advance.

58

4.3.2. Top-Level Wiring and Testing Module

Ampire.v is the file that connects all processor modules together, with a lot of

wires. The buses are declared to be trireg to simulate the capacitance in wires by

keeping the signals at the last driven values when all drivers are at high impedance

state. This only reduces the number of signal transitions and is not required to satisfy

any timing conditions in AMPIRE. Passive pulldowns are used to keep the shared

handshaking lines from floating because all handshake transitions are significant in a

real asynchronous system, and noise should be minimized.

The test setup routine appears toward the end of the ampire.v listing. A large

initial block organizes the signals for generating simulation waveforms that will be

shown in the next chapter. The other initial block starts the processor by issuing a

global reset. After that, the processor runs at its own pace until the stop instruction is

encountered or until the preset simulation time is exceeded. The #100 delay before

stopping Verilog allows instructions still running in the processor to be completed. The

high delay value for abnormal termination is mainly used for catching infinite loops

created by software and "hardware" bugs.

59

Chapter FiveBehavioral Simulation

5.1. Instruction Fetch/Issue and Queue Operations

In this chapter, several test programs and their simulation waveforms will be

presented. Figures 5.1 and 5.2 show the effects of running the following program with

slow and fast instruction memories, respectively. Figures 5.3 displays only the first 85

time steps of Figure 5.2 so that the sequence of events can be easily observed.

// ; test1.a --// ; jumps, IF stage, reservation, ALU & REGDWB queues//// ; DLY_ALU_ADD=20, DLY_ALU_OR=30, DLY_ALU_XOR=10// ; DLY_ALU_PASS=5, DLY_IMEM_RD=1 and 10//

000000000 // 00 nop00C000018 // 04 jal sub

//034000001 // 08 ori r0, r0, 1 ; write R003401000F // 0C ori r1, r0, 15038020005 // 10 xori r2, r0, 5000221821 // 14 addu r3, r1, r2 ; read after write050230004 // 18 slli r3, r1, 4 ; write after write144000064 // 1C trap 100 ; stop

//14BE00000 // 20 sub: jr r31 ; return

Im_req is the request signal from PC to IMEM to start a memory access cycle,

and im_ack is the corresponding acknowledgement. Iq_req from IMEM notifies IQ

that an instruction has been retrieved, and then IQ sends iiu_req to IIU when the

instruction is ready at the output of the queue. Pc_load is active when a branch is

taken, and the two pc_load pulses represent the two jumps in the test program. Please

see Figure 3.2 for direction of data flow.

Since the instruction cycle continues until IQ fills up or a branch occurs, the

60

test1a1 with DLY_IMEM_RD=10 Time Scale: 0 to 346

pc_load

pc_out

im_req

im_ack

im_inst

iq_req

iq_ack

I_bus

iiu_req

iiu_ack

seq_num

12c02802412011c

14be00000

144000064050230004

87654321

Figure 5.1: Test 1 with Slow IMEM

test1a2 with DLY_IMEM_RD=1, Full Time Scale: 0 to 320

pc_load

pc_out

im_req

im_ack

im_inst

iq_req

iq_ack

I_bus

iiu_req

iiu_ack

seq_num

12c02802412011c018030

14be00000

144000064050230004

87654321

Figure 5.2: Test 1 with Fast IMEM

test1a3 with DLY_IMEM_RD=1, Close-Up Time Scale: 0 to 85

pc_load

pc_out

im_req

im_ack

im_inst

iq_req

iq_ack

I_bus

iiu_req

iiu_ack

seq_num

00c10803012c02802412011000c108104

3221

A

B

C

Figure 5.3: Test 1 with Fast IMEM, Close-Up

61

number of pre-fetched instructions depends on the speed and depth of IQ and the access

time of IMEM. Looking at the im_ack signal before the first pc_load pulse, 3 memory

cycles completed in Figure 5.1, compared with 5 cycles in Figure 5.2. Furthermore,

looking at point A in Figure 5.3, the response time of im_ack is significantly slower

than the previous four. At that time, the IQ is full and cannot accept more data until an

entry is removed from its output at point B.

As discussed in section 2.1, the 4-phase handshake protocol requires that an

activated request signal stays high until its acknowledgement is received. Checking

these waveforms would reveal many violations; point C in Figure 5.3 is an example.

However, these are caused by local resets of the IF stage as initiated by pc_load and,

therefore, are not real violations of the protocol.

Figure 5.4 shows the simulation result of the same test program, displaying signals

from other parts of the processor. The JAL instruction needs to store the current PC in

R31, and that register is reserved at point A. The next instruction JR R31 checks for

register availability at point B, but it is held until R31 is cleared at point J. Note that

R31 is not written to the register file until point N, after that instruction is validated at

point R. This means that R31 is read from REGDWB instead of the actual register file.

The two ORI instructions demonstrate that "writing" R0 does not block other

instructions from using it, since it is a zero constant. R0 with the first ORI is not

cleared until point K, but reservation clearance for the second ORI is already completed

at point C. It may be argued that the register clearance at point K and the write

operation at point O are not necessary, since R0 cannot be written anyway. However,

eliminating these steps would need at least an additional zero detector to handle this

special case; another performance and cost trade-off.

The waveforms at points (D,L) and (E,M) indicate the dependencies for read-

62

test1b with DLY_IMEM_RD=1 Time Scale: 0 to 320

pc_load

pc_out

im_req

im_inst

iq_req

I_bus

iiu_req

iiu_ack

seq_num

reg_rs

reg_rt

reg_rd

log_req

cki_req

res_req

rf_req

func

alu_req

R_seq

R_data

R_dest

alu_arb

clr_req

wrf_reg

wrf_req

K_seq

K_chkid

val_seq

val_reg

12c02802412011c018030

14be00000

144000064050230004

87654321

0101001f00

00020000

03030201001f

060004030f

76541

000000005100000008

0302011f

030201001f

7767654654321210

3130130101010130

76543210

A B C D E

F G

H IJ K L M

N O P Q

R

Figure 5.4: Test 1, Detailed Activities

after-write and write-after-write, respectively. Instructions are held by the reservation

table until the required registers are available. The signals at points (F,H) and (G,I)

show the input-output timing for the ALU. The ALU is able to accept a new request

(pulse G) before finishing its current operation (pulse H) because of its internal buffer.

Similarly, (H,P) and (I,Q) are the input-output pairs for REGDWB, indicating queuing

inside that block.

63

test2a1 with All Delays=1 Time Scale: 0 to 198

I_bus

iiu_ack

seq_num

reg_rs

reg_rt

reg_rd

log_req

res_req

rf_req

alu_req

mem_req

dmem_rq

dq_req

R_seq

R_data

R_dest

alu_arb

dq_arb

clr_req

wrf_reg

wrf_req

val_seq

val_reg

val_mem

14400006408c03004003422000718c010040

4321

0000010000

00 0000

0303020100

321

000000000100000007000000000

030201

0201

3210

S L L

S L L

L L

A

B CD

E

F

Figure 5.5: Test 2 with All Delays=1

5.2. Delayed Write Buffers and Checker Arbitration

Figure 5.5 is the simulation output of test program #2 with all delays=1.

Mem_req is the DMEM request signal from IIU to MEMDWB, and then MEMDWB

communicates with DMEM through dmem_rq as needed. Dq_req is driven either by

DMEM or MEMDWB, depending on where the data is actually retrieved. These three

waveforms are labeled S and L for store and load, for ease of reference. The ORI

instruction is another example of reading data from REGDWB, as indicated by the

timing of pulse A occurring before pulse E.

64

test2a2 with DLY_DMEM_WR=15 Time Scale: 0 to 197

I_bus

iiu_ack

seq_num

reg_rs

reg_rt

reg_rd

log_req

res_req

rf_req

alu_req

mem_req

dmem_rq

dq_req

R_seq

R_data

R_dest

alu_arb

dq_arb

clr_req

wrf_reg

wrf_req

val_seq

val_reg

val_mem

14400006408c03004003422000718c010040

4321

0000010000

00 0000

0303020100

321

000000000100000007000000000

030201

0201

3210

S L L

S L

L L

XY

Figure 5.6: Test 2 with Long DMEM Write Cycle

// ; test2.a -- MEMDWB, REGDWB, checker arbitration//// ; one at a time: DLY_DMEM_WR=15, DLY_CKM_CHK=10// ; DLY_CKF_CHK=10//

1AC000040 // 00 sw 40h(r0), r018C010040 // 04 lw r1, 40h(r0) ; read from MEMDWB034220007 // 08 ori r2, r1, 7 ; read from REGDWB08C030040 // 0C lw r3, 40h(r0) ; read from memory144000064 // 10 trap 100

The SW instruction is sent to MEMDWB at point B, but the memory write is not

done until point D because it has to be validated at point F first. When the write is

completed, that entry is deleted from MEMDWB, and the LW instruction at point C has

to retrieve the data from DMEM. With a long memory write cycle, as shown in Figure

65

test2a3 with DLY_CKM_CHK=10 Time Scale: 0 to 196

I_bus

iiu_ack

seq_num

reg_rs

reg_rt

reg_rd

log_req

res_req

rf_req

alu_req

mem_req

dmem_rq

dq_req

R_seq

R_data

R_dest

alu_arb

dq_arb

clr_req

wrf_reg

wrf_req

val_seq

val_reg

val_mem

14400006408c03004003422000718c010040

4321

0000010000

00 0000

0303020100

321

000000000100000007000000000

030201

0201

3210

S L

SL

Figure 5.7: Test 2 with Slow Memory Checker

5.6, the load request may arrive before the write is finished. Data is transferred directly

from MEMDWB to DQ at point Y, saving a DMEM read cycle. For the second LW

instruction, data is still read from DMEM at point X, as expected.

Because a memory store has to be validated before initiating the write cycle, a

long delay through the memory checker CKM may have a similar effect as the previous

case. In Figure 5.7, the read/search cycle for MEMDWB is already started when the

write operation takes place. Even though read has higher priority than write, the

external write cycle must be allowed to complete. The internal queue is already halted

by the read request, preventing data from being deleted due to write completion.

Therefore, data is retrieved from the buffer and sent to DQ.

66

test2b with DLY_CKF_CHK=3 Time Scale: 0 to 200

seq_num

log_req

log_ack

K_seq

cki_arb

ckf_arb

ckm_arb

ckr_arb

chk_ack

4321

3432321210

A

BC

D

Figure 5.8: Test 2, Showing Arbitration of Checkers

Figure 5.8 shows the activities between the four checkers and the instruction log.

The K_seq bus is shared by all checkers to send the sequence number being verified to

LOG. The single chk_ack line from LOG is routed to all checkers, but it is meaningful

only to the checker that has been granted access to the K_bus. Two arbitration requests

are overlapped at points B and D. Pulse B is actually active one time step ahead of

pulse D, and therefore, CKI is given access while CKR must wait.

This figure also indicates that the log_ack at point A has a longer delay than other

cycles. That is caused by communication with CKF at point C because the queue in

LOG must be kept from advancing while the appropriate check bit is searched and

updated in the queue.

5.3. Arbitration for R_bus and Out-of-Order Completion

Just like the K_bus, an arbiter is also needed for the R_bus to control data flow

from ALU and DQ. The following program is run with different DMEM delays, and

the results are shown in Figures 5.9 and 5.10.

67

test3a1 with DLY_DMEM_RD=23 Time Scale: 0 to 210

I_bus

iiu_ack

seq_num

reg_rd

log_req

res_req

rf_req

alu_req

mem_req

dmem_rq

dq_req

R_seq

R_data

R_dest

alu_arb

dq_arb

clr_req

wrf_reg

wrf_req

val_seq

val_reg

14400006402403000302402000208c010000

4321

03020101

3120

00000000308c010000

030101

010101

32100

DE F

X W

Y Z

Figure 5.9: Test 3 with Slow DMEM

// ; test3.a -- R_bus arbitration//// ; DLY_DMEM_RD=1 and 23//

08C010000 // 00 lw r1, 0(r0)08C010000 // 04 lw r1, 0(r0) ;completed after next instruction024020002 // 08 addui r2, r0, 2024030003 // 0C addui r3, r0, 3144000064 // 10 trap 100

In Figure 5.9, arbitration requests D and F occur simultaneously. Access is

granted to ALU because the previous bus cycle was taken by DQ at point E. As a

consequence, result for the third instruction (ADDUI) is written to REGDWB before

the second instruction (LW). Since REGDWB is a FIFO queue, data are also

committed to the register file in reverse order, as can be seen by the W and X pulses in

the two figures. Even though instructions may complete out of order, they are always

68

test3a2 with All Delays=1 Time Scale: 0 to 187

I_bus

iiu_ack

seq_num

reg_rd

log_req

res_req

rf_req

alu_req

mem_req

dmem_rq

dq_req

R_seq

R_data

R_dest

alu_arb

dq_arb

clr_req

wrf_reg

wrf_req

val_seq

val_reg

14400006402403000302402000208c010000

4321

03020101

3210

00000000310000000208c010000

03020101

020101

3210

A B

C

W X

Y Z

Figure 5.10: Test 3 with Fast DMEM

validated in the same sequence as issued, at points Y and Z.

Figure 5.10 also has two other results worth noting. The DQ-to-REGDWB cycle

at C is delayed because of the REGFILE access at A. At that time, the queue in

REGDWB is stopped so that uncommitted register data may be searched and retrieved.

For the same reason, the write cycle at W takes longer to complete because of the

rf _req signal at B.

69

test4a: All Faults Time Scale: 0 to 530

pc_load

pc_out

I_bus

iiu_req

iiu_ack

seq_num

reg_rd

log_req

res_req

rf_req

func

alu_req

mem_req

dmem_rq

dq_req

R_seq

R_data

R_dest

alu_arb

dq_arb

clr_req

wrf_reg

wrf_req

K_seq

K_chkid

K_err

val_seq

val_reg

val_mem

hlt_req

02802402802412012011c11c110

65654543432212

030300000202000001

001003

5330

10000000010000000100000000100000000c

03020201

01

565654332010

300300021031010103010

5432101

Figure 5.11: Test 4, All Faults

5.4. Fault Detection and Rollback

All of the simulation results presented thus far are free of faults (at least from the

processor’s point of view). In this section, we will see how the rollback process works

when faults are detected. Figure 5.11 is an overview of running test program #4 with

all faults shown. The five halt request hlt_req pulses at the bottom correspond to the

five faults in the program. Each one will be displayed and discussed in greater detail.

70

test4b: CKI Fault Time Scale: 7 to 106

pc_load

pc_out

I_bus

iiu_req

iiu_ack

seq_num

reg_rd

log_req

res_req

rf_req

func

alu_req

R_seq

R_data

R_dest

alu_arb

clr_req

wrf_reg

wrf_req

K_seq

K_chkid

K_err

val_seq

val_reg

hlt_req

01801411000c10810401801411000c108

1c4020001100000000

21210

000101

0303

00

00000000c00000000c

0101

01

21010

03010

101

A

B

C

D

E

F

Figure 5.12: Test 4, CKI Fault

// ; test4.a -- faults, rollback//

03401000C // 00 ori r1, r0, here100000000 // 04 *nop ; cki fault0C8200000 // 08 jrf r1 ; ckf fault1C4020001 // 0C here: adduif r2, r0, 1 ; ckr fault1CC000040 // 10 swf 40h(r0), r0 ; ckm fault08C03001C // 14 lw r3, badata(r0) ; ckr fault144000064 // 18 trap 100

//100000000 // 1C badata: *.word 0

The first fault is caused by the instruction NOP retrieved from IMEM, and it is

detected at point D in Figure 5.12. The three K_bus components indicate that there is

an error for sequence number 1, and it is detected by checker 0 (CKI). The instruction

71

log controller then initiates a rollback cycle by activating hlt_req at point F, and the

program counter is updated at point A. Before the rollback starts, the result for the first

instruction ORI is already written to REGDWB at point B. Since it is not dependent on

the erroneous instruction, data is kept in the REGDWB, and after it is validated at point

E, it is written to the REGFILE at point C.

The second fault is from the JRF instruction. Register R1 written by ORI is

correct, but JRF forces a bad parity bit when R1 is read, and the fault is detected by the

register file checker CKF. The pc_load signal at point G in Figure 5.13 is the first JRF

attempt. A new address is loaded at point H for rollback, and the final branch occurs at

point I.

test4c: CKF Fault Time Scale: 78 to 168

pc_load

pc_out

I_bus

iiu_req

iiu_ack

seq_num

reg_rd

log_req

res_req

rf_req

K_seq

K_chkid

K_err

val_seq

hlt_req

00c01801411000c10811000c018014110

1c40200011c4020001

32232

0000

2 221

10110

221

G H I

Figure 5.13: Test 4, CKF Fault

72

The next instruction ADDUIF tells the ALU to perform an addition with bad

parity. The results can be seen on the R_data bus in Figure 5.14, before the rollback at

point J, and after the rollback at point K. Both data words are written to REGDWB, but

the incorrect one is invalidated in the rollback process, and only one word is actually

committed to the register file at point L. Similarly, the SWF instruction causes an

erroneous data to be written to the MEMDWB at point M in Figure 5.15. The rollback

cycle makes the correction, and the data is sent to DMEM at point N.

test4d: ALU CKR Fault Time Scale: 169 to 313

pc_load

pc_out

I_bus

iiu_req

iiu_ack

seq_num

reg_rd

log_req

res_req

rf_req

func

alu_req

R_seq

R_data

R_dest

alu_arb

clr_req

wrf_reg

wrf_req

K_seq

K_chkid

K_err

val_seq

val_reg

hlt_req

12011c12011c

08c03001c1cc00004008c03001c1cc000040

5434

00020002

000010

3 330

10000000100000000100000000100000000c

02 020201

02

4343343

130103010

332

J K

L

Figure 5.14: Test 4, ALU CKR Fault

73

test4e: CKM Fault Time Scale: 272 to 393

pc_load

pc_out

I_bus

iiu_req

iiu_ack

seq_num

reg_rd

log_req

res_req

rf_req

mem_req

dmem_rq

K_seq

K_chkid

K_err

val_seq

val_mem

hlt_req

024120120

14400006408c03001c08c03001c

54454

030000

54434

02102130

443

MN

Figure 5.15: Test 4, CKM Fault

Another CKR fault takes place when the LW instruction tries to load a bad word

from DMEM. Since the error is in DMEM and not in MEMDWB, two DMEM read

cycles occur in Figure 5.16, at points O and P. The rest of the events are similar to the

CKR fault caused by ADDUIF because both instructions modify the register file. Note

that the pulse at N is from the previous SWF instruction.

Figure 5.17 takes the first fault at NOP as an example to show the detailed

interactions between IIU and LOG. The four halt and rollback signals are discussed in

section 3.5.2 and also shown in Figure 3.7. At point Y, the IIU sends the sequence

number on seq_num , the instruction address on A_bus_L , and the instruction on B_bus

to the LOG. A_bus_L is just the lower bits of A_bus that are significant for address

formation. Before the rol_req signal is released at point X, the rollback address is

retrieved from LOG and sent to IIU on A_bus_L . The IIU then updates its internal

program counter and loads it into PC at point Z.

74

test4f: DMEM CKR Fault Time Scale: 369 to 530

pc_load

pc_out

I_bus

iiu_req

iiu_ack

seq_num

reg_rd

log_req

res_req

rf_req

mem_req

dmem_rq

dq_req

R_seq

R_data

R_dest

dq_arb

clr_req

wrf_req

K_seq

K_chkid

K_err

val_seq

val_reg

hlt_req

028024028024

100000000144000064100000000144000064

6565

03030303

5 55

000000000100000000100000000100000001

03 0303

565565

30210302102

554 4

N O P

Figure 5.16: Test 4, DMEM CKR Fault

test4g: Rollback Sequence Time Scale: 8 to 70

val_seq

hlt_req

hlt_ack

rol_req

rol_ack

seq_num

A_bus_L

B_bus

log_req

cki_req

pc_load

pc_out

1

1210

104108104000000

0c820000010000000003401000c

10401801411011000c108

X

Y

Z

Figure 5.17: Test 4, Rollback Sequence

75

5.5. Running a Real Program

A simple, useful program is shown below to exercise various parts of the

processor while causing three different faults. The program should be self-explanatory,

and the simulation waveforms are displayed in Figure 5.18. Even though the details are

too small to be traced, some properties, such as the variations in register reservation

delays res_req and occurrences of rollback hlt_req , can still be observed.

// ; test5.a -- find the largest number//

13401003C // 00 *ori r1, r0, first02402004C // 04 addui r2, r0, last100001825 // 08 or r3, r0, r0 ;current largest number

//18C240000 // 0C loop: lw r4, 0(r1) ; load from memory10083282B // 10 sgt r5, r4, r3 ; r5 = 1 if r4 > r3010A00004 // 14 beqz r5, smallr000041825 // 18 or r3, r0, r4 ; found larger number000223028 // 1C smallr: seq r6, r1, r2 ; check if done114C00008 // 20 bnez r6, done124210004 // 24 addui r1, r1, 4 ; increment pointer00BFFFFE0 // 28 j loop

//034070050 // 2C done: ori r7, r0, largst1CCE30000 // 30 swf 0(r7), r3 ; store in memory044000003 // 34 trap 3 ; print the number144000064 // 38 trap 100 ; stop

//00000000A // 3C first: .word 10100000005 // 40 *.word 510000000D // 44 .word 13100000002 // 48 .word 2000000009 // 4C last: .word 9

//000000000 // 50 largst: .word 0

76

test5a: Find the Largest Number Time Scale: 0 to 1543

pc_load

pc_out

I_bus

iiu_req

iiu_ack

seq_num

log_req

res_req

rf_req

func

alu_req

mem_req

dmem_rq

dq_req

R_seq

R_dest

alu_arb

dq_arb

clr_req

wrf_reg

wrf_req

cki_arb

ckf_arb

ckm_arb

ckr_arb

K_seq

K_chkid

K_err

val_seq

val_reg

val_mem

hlt_req

7764321dcba32ecbcb43

030d0a000d0a000d0a000d0a000d0a03

642fdb8630ecb9742

07060501060501060501060504010605

0605010605010605010605010605

71a8520bb960

230

75310eca97421fdba85320

Figure 5.18: Test 5, Find the Largest Number

77

Chapter SixHardware Design and Gate-Level Simulation

Gates.v in Appendix A is a collection of gate-level components, from which more

complex circuits are built in other Verilog modules. The main purpose of this chapter

is to show how some structures used in the behavioral Verilog model may be built in

hardware. Some circuit diagrams contain high level logic gates for simplicity and

clarity, but more efficient implementations may be found at the full transistor level.

6.1. C-Elements

C-element is a very basic unit in a self-timed system, and a NMOS 2-input C-

element appears on [Seit80, p. 255]. The number of inputs may be extended by adding

transistors to the parallel and series structures, as shown in Figure 6.1. However, the

number of transistors in series should be limited because of the body effect.

out

in 3in 2in 1

in 3 in 2 in 1M2 M1

Figure 6.1: 3-Input C-Element

When a C-element with large number of inputs is required, it can be easily built

from smaller C-elements because its function is associative. The 14-input C-element

78

used in bigc.v is shown in Figure 6.2. Sometimes a C-element needs to be reset to clear

its state, and that can be accomplished by forcing the inputs to be zeros. Figure 6.3

represents the muller 2c module in gates.v . Other C-element implementations can be

found in [Berk91, Jaco90, Suth89].

output

C

CCCC

1413121110987654321inputs

Figure 6.2: 14-Input C-Element

C out

in 1

in 2

clear

outCin 2

in 1

clear

Figure 6.3: 2-Input C-Element with Clear

All transistor and gate delays in the Verilog model are set to 1 unless there are

timing issues involved. This code segment comes from the muller 3 module:

gates.v 29: nmos #2 m1 (out, gnd, b);gates.v 30: nmos #1 m2 (b, a, out);

Transistors M1 and M2 are marked in Figure 6.1. The cross-coupled structure needs to

be biased so that oscillation would not occur in simulation, due to exactly matched

delays when two or more inputs switch at the same time in opposite directions. This

79

kind of switching does not occur under the 4-phase handshake protocol, except during

local reset when the transaction is canceled anyway. In a real circuit, the C-element

should not oscillate.

6.2. Arbiter

Another circuit, an interlock element, is also presented on [Seit80, p. 261]. It is

shown in Figure 6.4 along with logic for rollback signals, which make up the arbiter for

R_bus. This arbiter is non-prioritized, and multiple modules can be connected in a

binary-tree fashion to support more than two arbitration requests.

roll_ack

halt_ack

ack 1ack 0

roll_req

halt_req

req 1

req 0

ack 0

ack 1

v 1

v 0

Figure 6.4: Arbiter for R_bus

Since the arbiter is not directly involved in transferring data, it is always reset

when a rollback occurs. As long as the actual data transmitter and receiver handle the

rollback sequence properly, the outputs of the arbiter do not have to be stopped before

sending halt_ack . This simplification allows the arbiter to run at full speed in normal

operations. The 3-input AND gate delays the generation of roll_ack until all arbitration

grants (ack s) are released.

Figure 6.5 is the simulation result of running test program #3 (section 5.3) with the

gate-level arbiter. Alu_arb is connected to req 0, and dq_arb is connected to req 1.

When both requests become active at the same time, ALU wins because the circuit is

80

arbr1 [test3] with DLY_DMEM_RD=23 Time Scale: 103 to 220

R_seq

alu_arb

dq_arb

req0

v0

ack0

req1

v1

ack1

31200

Figure 6.5: Gate-Level Arbiter

biased toward req 0 due to the cross-coupled AND gates. In actual implementation,

slight mismatch between cross-coupled pairs would help combat the problem of

metastability.

6.3. Register Completion Detector

Before the actual completion circuit is discussed, some latches need to be

introduced first. Figure 6.6 is a simple latch made from inverters and pass gates. For

actual circuit implementation, it may be more desirable to use full CMOS pass gates to

reduce power consumption, but they do not make any difference in logic simulation.

Figure 6.7 adds the capability to set or reset the latch by connecting one of the S/R

transistors (but not both). Again, the NMOS pull-up is used only for simplicity.

D

G

QQD

G

Figure 6.6: D-Latch

After data is latched in a register, a signal has to be generated to notify other

81

GD Q

Q

G

D

R

S

S /R

S /R

Figure 6.7: D-Latch with Set/Reset

same 2

same 1

osc q 2

q 1done

regclk

complete

latch

reset

reset reset

resetC

RD Q

RGD Q

SGD Q

QDG

QDG

x1

x2

x3

REGISTER

Figure 6.8: Register Completion Detector

circuits about its completion. A fully self-timed circuit requires a completion signal

generator (XNOR) for each bit of the register and a large C-element to merge them.

The overhead is very high, especially if the processor has a high proportion of register

elements.

The completion scheme used in AMPIRE is shown in Figure 6.8. Component x1

is a master-slave flip-flop made from two resettable D-latches. With an inverter

connected from its output back to the input, the output is flipped every time a latch

request is received. X2 and x3 model the delay of the actual register, and they are most

82

likely to be slower than the real register on the same die (compare Figures 6.6 and 6.7).

Together with the flip-flop "oscillator", the outputs q1 and q2 are switched every latch

cycle, modeling both rise and fall delays.

For the register, the gate is opened when the latch signal becomes active. After

the signal done becomes high, the gate is closed, protecting the register from further

changes at the input. Therefore, the latches in the register behave like edge-triggered

flip-flops without the additional hardware complexity. The register can be arbitrarily

long, but the completion circuit must be placed properly to take the wire delays into

account. Note that the reset signal is not necessarily the global processor reset.

Simulation waveforms will be shown with the gate-level instruction queue.

6.4. Instruction Queue

The buffers in AMPIRE may be separated into two groups: the ones which are

always cleared during rollback or other conditions, and the ones with selective

invalidation. The IQ buffer falls into the former group and will be presented here, and

the latter type will be discussed in the next section. The actual storage elements are

simple latches that can be used with the register completion detector in the previous

section. What we need is additional circuitry to communicate with other modules or

sub-modules.

The IQ buffer control is shown in Figure 6.9. The two C-elements make up a 4-

phase full-handshake circuit, and they represent the input and output cycles in the

behavioral code. Please see [Meng89] for detailed description on this handshake

circuit. Cancel_r is asserted when a rollback or branch occurs. The buffer for

cancel_a has to provide enough delay to reset the control circuit, including the register

completion unit. A better method should really be used to generate that

83

cancel_a

clearresetcancel_r

CC out_r

out_a

in_r

in_a

completelatch

Figure 6.9: Control for IQ Buffer

iq1 [test1] Time Scale: 73 to 171

pc_load

pc_out

in_d

out_d

in_r

in_a

out_r

out_a

can_r

can_a

latch

osc

q1

q2

same1

same2

done

regclk

complet

12000c108108

03400000100c00001800c000018

03400000100c00001800c000018

AB

CD

E

FG

Figure 6.10: Gate-Level IQ

acknowledgment because matching the delays of a large circuit may be difficult and not

always reliable.

Figure 6.10 is simulated with full gate-level buffers, running test program #1

(section 5.1). Pulses (A,B) and (C,D) form the 4-phase handshake pairs for input and

84

output, respectively. The output request C can be started before the input

acknowledgement B is finished because the data is already latched and ready for output.

Looking at the signals inside the register completion detector, the done signal at

point F becomes high after the input data has propagated to the output of the latch, as

checked by the two same signals. Then the register clock regclk is turned off at point

G. When a branch is taken, the queue is cleared in response to the cancel request

can_r , which is connected directly to pc_load . As a result, the signal out_r is dropped

at point E without receiving a corresponding acknowledgement.

6.5. Data Queue

Unlike the IQ buffer, the buffers in DQ, DWBs, checkers, etc. have to be

selectively invalidated in the event of a rollback. Since invalidation is based on

sequence number comparison, the comparator hardware will be discussed next.

6.5.1. Sequence Number Comparator

The rollback code was introduced in section 4.1.6, and these lines describe what

needs to be done to determine if data should be invalidated:

diff = sequence_num - sequence_error;if (˜diff [high_bit])

valid = 0;

The subtractor can be simplified because only the highest bit of the difference is used.

Four bits are allocated for the sequence number in AMPIRE, and therefore, we need

three bits of borrow (as opposed to carry) and one bit of difference circuits, as shown in

Figure 6.11. BRWEND is a borrow generator without borrow-in, for the least

significant bit. The logic equations for these blocks are (for A!B):

85

BRW

req

XX

ZZ

BA

BRWEND

req

ZZ

BAA BZZ

XX

reqBRWSUB

req

XX

DD

BA

ack

0123

reqack

seqerror

invalid

Figure 6.11: Sequence Number Comparator

YYreqreq

req

A

B

X

BA

X

B

A

B

A

YYreqreq

req

A

BBA

Figure 6.12: Borrow Circuits, Full and Half

BRWEND: Z = AB

BRW: Z = AB + AX + BX = AB + X (A +B )

SUB: D = A xor B xor X

The self-timed components appear in Figures 6.12 through 6.14, which are

designed using a technique called DCVSL, differential cascode voltage switch

logic [Jaco90, Meng89]. The PMOS transistors precharge the Y and Y nodes when req

is low, and the NMOS transistor that is gated by req prevents premature discharge.

When the request signal is activated, only one of the precharged nodes is pulled low

because the two NMOS trees are complementary. The output inverters are added, as

86

Z

Z

B

A

req

req Y

B

A

B

A

req Y

A

BAABB

ZZ

reqXOR

Figure 6.13: XOR Gate

XORXOR

req

ZZ

BBAA

req

ZZ

BBAA

B

A

XX

req

ack

DD

Figure 6.14: Difference (SUB) Circuit

done in Figure 6.13, so that the voltage level is low when the circuit is inactive or not

ready. The differential outputs can then be fed directly into another DCVSL circuit. At

the final stage of the chain (SUB), the outputs are ORed together to generate the

completion/acknowledgement signal. Because the circuit is self-timed, completion

time varies with the data being compared.

6.5.2. DQ Control

The control circuit for the DQ buffer is shown in Figure 6.15. The register

completion detector is interfaced through latch and complete , like the IQ buffer. A

sequence number comparator cycle is started with roll_req , and the acknowledgement

cmp_done is raised when the result invalid is available.

87

g9

g8

cmp_doneinvalid

cout 2clear 2

out_rin_acout 1

roll_req

reset

roll_ackC

halt_ackhalt_req

clear 2

clear 1

invalid

reset

latch complete

in_a

in_r

out_a

out_rcout 1 cout 2

roll_req

C Ca

Figure 6.15: Control for DQ Buffer

The square part with equal sign is just a standard resettable D-latch, with its gate

connected to halt_req (not shown) and its reset connected to clear1. In normal

operations, it is in the transparent mode (output = input). When a rollback cycle is

started, the inputs are cut off because halt_req becomes low, and thus accomplishing

wait (˜halt_req) in the behavioral code. However, complete cannot be

interrupted because if data is already latched, the sender must be notified, or data

duplication may occur when normal operation is resumed (see section 4.1.6). The delay

buffer in the halt_ack circuit is provided for that purpose.

88

The cout2 signal is equivalent to the valid variable in the behavioral code, and it is

cleared in the rollback process only if the corresponding register is invalidated, and

consequently activating clear2. The AND gates g8 and g9 partially determine when the

invalidation process is finished. If invalidation is not necessary, then it can proceed

once the comparison is done (g9). Otherwise, roll_ack cannot be sent until the cout2

signal is reset (g8).

6.5.3. DQ Simulation

A couple of test programs will be run to show the internal signals of the DQ buffer

and the sequence number comparator. Both programs read data from DMEM, which

are sent to the REGDWB through DQ. The first program causes a CKI fault, but it does

not invalidate LW.

// ; test6.a -- gate level DQ, no invalidation//// ; DLY_DMEM_RD=50, DLY_CKI_CHK=48 and 52//

08C010000 // 00 lw r1, 0(r0)100000000 // 04 *nop144000001 // 08 trap 1144000064 // 0C trap 100

In Figure 6.16, the output request out_r is set at point A, but the halt request is

also activated at the same time. Out_r is cleared by clear1 at point C, but since cout2

retains its value through the rollback process (not invalidated), another handshake cycle

is started at point B.

The signals in the lower half of the figure belong to the sequence number

comparator, showing 0!1. (_a,na) through (d,inv) are the differential output pairs of the

four sub-blocks in Figure 6.11. The waveforms are staggered because each one is

dependent on the result from the preceding stage.

89

dq1 [test6] with DLY_CKI_CHK=48 Time Scale: 127 to 269

in_r

cout1

latch

in_a

a

cout2

out_r

out_a

h_req

h_ack

r_req

clear1

clear2

r_ack

seq

error

req

_a

na

b

nb

c

nc

d

inv

ack

0 0 0

1 11

A B

C

Figure 6.16: Test 6, Gate-Level DQ with DLY_CKI_CHK=48

By delaying the rollback cycle a little, we get Figure 6.17. By the time h_req

becomes high at point G, the receiver has already started latching its register. That is

completed at point F, and the the acknowledgment clears cout2 at point D. The signal

out_r is held constant by a closed latch, and therefore, it is not dropped until point E,

after clear1 is raised.

The next program causes LW to be invalidated because of an error in the

preceding instruction.

90

dq2 [test6] with DLY_CKI_CHK=52 Time Scale: 127 to 224

in_r

cout1

latch

in_a

a

cout2

out_r

out_a

h_req

h_ack

r_req

clear1

clear2

r_ack

seq

error

req

_a

na

b

nb

c

nc

d

inv

ack

0 0 0

11

DE

FG

Figure 6.17: Test 6, Gate-Level DQ with DLY_CKI_CHK=52

// ; test7.a -- gate level DQ, invalidation//// ; DLY_CKI_CHK=15, DLY_DMEM_RD=8 and 1//

000000000 // 00 nop000000000 // 04 nop000000000 // 08 nop100000000 // 0C *nop08C010000 // 10 lw r1, 0(r0)144000001 // 14 trap 1144000064 // 18 trap 100

In Figure 6.18, an input request in_r is received and causes cout1 to become high.

However, that signal is not propagated to latch because the path is already closed by

91

dq3 [test7] with DLY_DMEM_RD=8 Time Scale: 133 to 190

in_r

cout1

latch

in_a

a

cout2

out_r

out_a

h_req

h_ack

r_req

clear1

clear2

r_ack

seq

error

req

_a

na

b

nb

c

nc

d

inv

ack

0 0 0

3 33

Figure 6.18: Test 7, Gate-Level DQ with DLY_DMEM_RD=8

h_req . Since the data is not received by this buffer, it is invalidated somewhere else.

In Figure 6.19, the sequence number comparison results in the inv signal being high at

point I. Then clear2 invalidates the data by resetting cout2 at point H. Notice that

there is no activity on the out_r line during this period of time.

The sequence number comparisons performed in Figures 6.18 and 6.19 are: 0!3

and 4!3 (0000!0011 and 0100!0011 in binary). For both cases, the borrow generation

for bit 1 is not dependent on the result from bit 0. Therefore, _a and b become high at

the same time, reducing the time for borrow propagation.

92

dq4 [test7] with DLY_DMEM_RD=1 Time Scale: 127 to 196

in_r

cout1

latch

in_a

a

cout2

out_r

out_a

h_req

h_ack

r_req

clear1

clear2

r_ack

seq

error

req

_a

na

b

nb

c

nc

d

inv

ack

4 44

3 33

H

I

Figure 6.19: Test 7, Gate-Level DQ with DLY_DMEM_RD=1

6.6. Fault Simulation with Gate-Level Modules

Four modules in AMPIRE have gate-level models: BIGC, ARBR, IQ, and DQ.

Each one can be switched through declarations in the parameter file. Figure 6.20 is the

result of running test program #4 (section 5.4) with all gate-level models enabled. The

waveforms are very similar to the ones in Figure 5.11, which is from the behavioral

model. Figure 6.20 does have an extra REGFILE write request pulse at point A. The

data is not actually written at that time because it is interrupted by the rollback cycle at

93

allgates [test4a]: All Faults Time Scale: 0 to 1405

pc_load

pc_out

I_bus

iiu_req

iiu_ack

seq_num

reg_rd

log_req

res_req

rf_req

func

alu_req

mem_req

dmem_rq

dq_req

R_seq

R_data

R_dest

alu_arb

dq_arb

clr_req

wrf_reg

wrf_req

K_seq

K_chkid

K_err

val_seq

val_reg

val_mem

hlt_req

656545434323212

03030002000001

001003

53300

10000000010000000100000000100000000c

0302020101

0201

5656543210

303023110

54321

A B

C

Figure 6.20: Fault Simulation with Gate-Level Modules

point C. After the processor is restarted, the transaction is repeated at point B.

94

Chapter SevenConclusion

This thesis has demonstrated a fault-tolerance method for an asynchronous

processor. The register reservation mechanism guarantees mutual exclusion on the

registers and enables concurrency for independent instructions. The instruction log and

delayed write buffers provide temporary storage for un-validated data so that permanent

state elements are not affected until verification is finished. Because execution of

instructions may complete out of order in an asynchronous environment, sequence

numbers are used to track dependencies at the instruction level, and instructions which

need to be rolled back can be properly identified and undone.

The performance of a synchronous system is penalized because the clock rate is

limited by the longest path of all pipeline stages, and time may be wasted in many

stages. An asynchronous system requires only as much time as necessary to complete a

task, and therefore, it achieves an average processing speed, not the worst case.

Asynchronous design also allows the system to be very modular because there is no

global constraint like the clock. Blocks of different speeds may be mixed and matched

without altering the functionality of a system, as long as the handshake protocol is not

violated. However, some resources may be well under-utilized if the system is not

balanced properly, and the overhead for handshaking must be considered.

One of the difficulties with a fully asynchronous design is generating the

handshake signals in the correct sequence. A state machine in the traditional

synchronous sense is not available because there is no clock. Certain sequencing can be

forced by introducing delay elements, as done in some circuits in Chapter 6, but their

accuracy and reliability in different conditions (electrical, thermal, and IC process

95

variations) are the drawbacks. Too short of a delay makes the circuit non-functional,

but the performance suffers if the delay is longer than necessary. CAD tools should

really be used to assist the design of complex asynchronous systems, in terms of

handshake circuit synthesis and analysis.

As discussed in Chapter 2, delay-insensitive circuits guarantee correct

asynchronous operations because no assumption is made on wire delays. However, the

increased circuit complexity and chip area can be significant, and the larger physical

size does mean increased propagation delays. Since gate and wire delays can be

controlled locally without much difficulty, self-timed but delay-dependent circuits can

be applied in small blocks. Proper trade-offs between the two design methods should

lead to a better implementation than either one alone.

One of the goals at the beginning of this research was to design an equivalent of

micro rollback in an asynchronous environment. Micro rollback works at a finer

granularity than instruction retry, and as a result, the number of repeated operations is

reduced. For example, it is not necessary to fetch an instruction again if the error of the

same instruction occurs in the execution stage. In an asynchronous processor, however,

each unit cannot simply go back a few steps because each one operates at its own speed.

Dependency tracking probably has to be based on individual operations rather than

instruction sequence numbers, and if more than one dependency has to be kept [Stro85],

then it may become very difficult to manage. Furthermore, each functional unit (such

as the ALU) may need to keep a history log so that previous data can be played back for

rollback.

Some areas of this project can be improved and expanded. In AMPIRE, the

instruction and data memories are completely separate units so that they are easier to be

controlled. In a real system, though, a single memory device usually holds both

96

instructions and data. An arbiter has to be added to coordinate instruction and data

access, and the instruction fetch process has to be changed because now the IMEM

portion can be modified. Also, many faults cannot be detected and corrected with a

single processor. If two asynchronous processors are to be connected as a master/slave

pair like the UCLA Mirror Processor, then additional handshake mechanism has to be

added for synchronization, which is another layer of overhead. Since AMPIRE has not

been physically implemented, the impact of the addition of fault-tolerant features

cannot be quantitatively measured. However, based on the architecture of parallel data

verification, results should be similar to those of the Mirror Processor: high overhead in

chip area but insignificant reduction in performance.

97

Appendix AVerilog Simulation Code

The Verilog modules for AMPIRE are listed in alphabetical order, except for

parameter and ampire.v , which appear first.

Module Descriptionparameter Parameter declarations.ampire.v Top-level wiring and test setup.alu.v Arithmetic logic unit.arbk.v Arbiter for K_bus.arbr.v Arbiter for R_bus.bigc.v Big C-element for rollback synchronization.ckf.v Checker for REGDWB outputs to A_bus and B_bus.cki.v Checker for instructions executed by the IIU.ckm.v Checker for data written to MEMDWB.ckr.v Checker for data written to REGDWB.dmem.v Data memory.dq.v Data queue for memory-to-register transfers.gates.v Gate-level components.iiu.v Instruction issuing unit.imem.v Instruction memory.iq.v Instruction queue.log.v Instruction log.memdwb.v Delayed write buffer for data memory.pc.v Program counter.regdwb.v Delayed write buffer for register file.regfile.v Register file.restable.v Register reservation table.

Table A.1: Verilog Modules

98

parameter

1: parameter ADDR_WIDTH = 8; // width of address bus (bytes)2: parameter ADDR_IGNORE = 2; // number of address bits to ignore3: parameter ADDR_INC = 4; // amount of address increment (2ˆIGNORE)4: parameter SEQ_WIDTH = 4; // bits for sequence number5: parameter FUNC_WIDTH = 5; // ALU function code width6: parameter CHECKERS = 4; // number of checkers7: parameter CHKID_WIDTH = 2; // number of bits for checker ID8: parameter DWBS = 2; // number of DWBs (for validation)9:10: parameter OP_WIDTH = 6; // bits for op code11: parameter REG_WIDTH = 5; // bits for register number12: parameter EXTRA_WIDTH = 11; // "extra" field, "func" in R-type13:14: parameter REG_SIZE = 32; // number of registers (2ˆREG_WIDTH)15: parameter IMEM_SIZE = 64; // number of instruction memory WORDS16: parameter DMEM_SIZE = 64; // number of data memory WORDS17:18: parameter IMM_WIDTH = REG_WIDTH + EXTRA_WIDTH; // immediate field19: parameter OFFSET_WIDTH = 2 * REG_WIDTH + IMM_WIDTH; // offset field20: parameter DATA_WIDTH = OP_WIDTH + OFFSET_WIDTH; // word size21:22: parameter RESET_TIME = 1; // 1 for behavioral, 30 for gate-level23: ‘define BEHAV_IQ // comment out for gate-level24: ‘define BEHAV_DQ // *** CHANGE RESET_TIME ***25: ‘define BEHAV_ARBR26: ‘define BEHAV_BIGC27:28: // ‘define WAVE_IQ // define to add signal waves29: // ‘define WAVE_REGCMPL30: // ‘define WAVE_DQ31: // ‘define WAVE_SEQCMP32: // ‘define WAVE_ARBR33:34: parameter DLY_PC_INC = 1; // delay settings35: parameter DLY_IMEM_RD = 1;36: parameter DLY_IIU_PCINC = 1;37: parameter DLY_IIU_DECOD = 1;38: parameter DLY_IIU_ADD = 1;39: parameter DLY_RT_RES = 1;40: parameter DLY_RT_CLR = 1;41: parameter DLY_RF_RD = 1;42: parameter DLY_RF_WR = 1;43: parameter DLY_ALU_ADD = 1;44: parameter DLY_ALU_SUB = 1;45: parameter DLY_ALU_AND = 1;46: parameter DLY_ALU_OR = 1;47: parameter DLY_ALU_XOR = 1;48: parameter DLY_ALU_PASS = 1;49: parameter DLY_ALU_SHIFT = 1;50: parameter DLY_ALU_COMP = 1;51: parameter DLY_DMEM_RD = 1;52: parameter DLY_DMEM_WR = 1;53: parameter DLY_SEQ_COMP = 1; // sequence comparison delay for all modules54:55: parameter DLY_CKI_CHK = 1;56: parameter DLY_CKF_CHK = 1;57: parameter DLY_CKM_CHK = 1;58: parameter DLY_CKR_CHK = 1;59:60: parameter ID_CKI = 0; // checker ID numbers61: parameter ID_CKF = 1;62: parameter ID_CKM = 2;63: parameter ID_CKR = 3;64:65: parameter XXX = 33’bx; // to overcome default size of 32 bits66: parameter ZZZ = 33’bz;67:68: parameter OP_SPECIAL = 0; // reg-reg special ops69: parameter OP_J = 2;70: parameter OP_JAL = 3;71: parameter OP_BEQZ = 4;

99

parameter

72: parameter OP_BNEZ = 5;73: parameter OP_ADDUI = 9;74: parameter OP_SUBUI = 11;75: parameter OP_ANDI = 12;76: parameter OP_ORI = 13;77: parameter OP_XORI = 14;78: parameter OP_LHI = 15; // shift by IMM_WIDTH bits79: parameter OP_TRAP = 17; // used as simulator special service80: parameter OP_JR = 18;81: parameter OP_JALR = 19;82: parameter OP_SLLI = 20;83: parameter OP_SRLI = 22;84: parameter OP_SRAI = 23;85: parameter OP_SEQI = 24;86: parameter OP_SNEI = 25;87: parameter OP_SLTI = 26;88: parameter OP_SGTI = 27;89: parameter OP_SLEI = 28;90: parameter OP_SGEI = 29;91: parameter OP_LW = 35;92: parameter OP_SW = 43;93: parameter OP_ADDUIF = 49; // add immediate with fault94: parameter OP_JRF = 50; // jump register with fault95: parameter OP_SWF = 51; // store word with fault96:97: parameter SOP_NOP = 0; // reg-reg special opcodes98: parameter SOP_SLL = 4;99: parameter SOP_SRL = 6;100: parameter SOP_SRA = 7;101: parameter SOP_ADDUF = 25; // add with fault102: parameter SOP_ADDU = 33;103: parameter SOP_SUBU = 35;104: parameter SOP_AND = 36;105: parameter SOP_OR = 37;106: parameter SOP_XOR = 38;107: parameter SOP_SEQ = 40;108: parameter SOP_SNE = 41;109: parameter SOP_SLT = 42;110: parameter SOP_SGT = 43;111: parameter SOP_SLE = 44;112: parameter SOP_SGE = 45;113:114: parameter FUNC_ADDU = 0; // ALU function codes115: parameter FUNC_SUBU = 1;116: parameter FUNC_AND = 2;117: parameter FUNC_OR = 3;118: parameter FUNC_XOR = 4;119: parameter FUNC_LHI = 5;120: parameter FUNC_SLL = 6;121: parameter FUNC_SRL = 7;122: parameter FUNC_SRA = 8;123: parameter FUNC_SLT = 9;124: parameter FUNC_SGT = 10;125: parameter FUNC_SLE = 11;126: parameter FUNC_SGE = 12;127: parameter FUNC_SEQ = 13;128: parameter FUNC_SNE = 14;129: parameter FUNC_PASS = 15;130: parameter FUNC_ADDUF = 16; // add with fault131:132: parameter TRAP_BLANK = 90; // must be >= REG_SIZE133: parameter TRAP_STOP = 100;

100

ampire.v

1: // AMPIRE Top-Level Wiring and Test Setup2:3: module ampire;4:5: ‘include "parameter"6:7: reg reset;8:9: // ====================================================================10: // bus connections11:12: trireg [DATA_WIDTH:0] i_bus,13: a_bus,14: b_bus,15: d_bus;16:17: trireg [SEQ_WIDTH-1:0] r_bus_seq;18: trireg [REG_WIDTH-1:0] r_bus_dest;19: trireg [DATA_WIDTH:0] r_bus_data;20: trireg r_bus_req;21:22: trireg [SEQ_WIDTH-1:0] k_bus_seq;23: trireg [CHKID_WIDTH-1:0] k_bus_chkid;24: trireg k_bus_error,25: k_bus_req;26:27: trireg tri_dq_req;28:29: pulldown (r_bus_req); // do not float handshaking signals30: pulldown (k_bus_req);31: pulldown (tri_dq_req);32:33: // ====================================================================34: // component outputs35:36: wire [ADDR_WIDTH:0]37: pc_out;38: wire pc_out_retry,39: pc_load_ack,40: pc_out_req;41:42: wire [DATA_WIDTH:0]43: imem_data;44: wire imem_in_ack,45: imem_out_req,46: imem_cancel_ack;47:48: wire iq_in_ack,49: iq_out_req,50: iq_cancel_ack;51:52: wire [CHECKERS-1:0]53: iiu_chkbits;54: wire [SEQ_WIDTH-1:0]55: iiu_seq_num;56: wire [REG_WIDTH-1:0]57: iiu_reg_rd,58: iiu_reg_rs,59: iiu_reg_rt;60: wire [FUNC_WIDTH-1:0]61: iiu_alu_func;62: wire iiu_inst_en,63: iiu_sim_f,64: iiu_mem_rw,65: iiu_retry,66: stop;67: wire iiu_inst_ack,68: iiu_log_req,69: iiu_cki_req,70: iiu_newpc_req,71: iiu_res_req,

101

ampire.v

72: iiu_rfile_req,73: iiu_ckf_req,74: iiu_alu_req,75: iiu_mem_req;76:77: wire restable_res_ack,78: restable_clr_ack;79:80: wire alu_comp_ack,81: alu_arb_req;82:83: wire [ADDR_WIDTH:0]84: memdwb_out_addr;85: wire [SEQ_WIDTH-1:0]86: memdwb_out_seq;87: wire [REG_WIDTH-1:0]88: memdwb_out_reg;89: wire memdwb_rw_mode,90: memdwb_out_retry,91: memdwb_in_ack,92: memdwb_chk_req,93: memdwb_val_ack,94: memdwb_out_req;95:96: wire dmem_in_ack;97:98: wire dq_in_ack,99: dq_arb_req;100:101: wire arbr_ack0,102: arbr_ack1;103:104: wire [SEQ_WIDTH-1:0]105: regdwb_out_seq;106: wire [REG_WIDTH-1:0]107: regdwb_out_reg,108: regdwb_wrf_reg;109: wire [DATA_WIDTH:0]110: regdwb_out_data,111: regdwb_wrf_data;112: wire regdwb_in_ack,113: regdwb_chk_req,114: regdwb_clr_req,115: regdwb_val_ack,116: regdwb_rd_ack,117: regdwb_rdf_req,118: regdwb_wrf_req;119:120: wire [DATA_WIDTH:0]121: rfile_rd_data1,122: rfile_rd_data2;123: wire rfile_wr_ack,124: rfile_rd_ack;125:126: wire cki_in_ack,127: cki_arb_req;128:129: wire ckf_in_ack,130: ckf_arb_req;131:132: wire ckm_in_ack,133: ckm_arb_req;134:135: wire ckr_in_ack,136: ckr_arb_req;137:138: wire arbk_ack0,139: arbk_ack1,140: arbk_ack2,141: arbk_ack3;142:

102

ampire.v

143: wire [SEQ_WIDTH-1:0]144: val_seq;145: wire log_ack,146: log_chk_ack,147: log_valmem_req,148: log_valreg_req,149: halt_req,150: roll_req;151:152: wire ch_out;153:154: wire cr_out;155:156: // ====================================================================157: // component inputs158:159: wire [ADDR_WIDTH:0]160: pc_load = i_bus;161: wire pc_in_retry = iiu_retry,162: pc_load_req = iiu_newpc_req,163: pc_out_ack = imem_in_ack;164:165: wire [ADDR_WIDTH:0]166: imem_addr = pc_out;167: wire imem_retry = pc_out_retry,168: imem_in_req = pc_out_req,169: imem_out_ack = iq_in_ack,170: imem_cancel_req = iiu_newpc_req;171:172: wire [DATA_WIDTH:0]173: iq_in_data = imem_data;174: wire iq_in_req = imem_out_req,175: iq_out_ack = iiu_inst_ack,176: iq_en_out = iiu_inst_en,177: iq_cancel_req = iiu_newpc_req;178:179: wire iiu_inst_req = iq_out_req,180: iiu_log_ack = log_ack,181: iiu_cki_ack = cki_in_ack,182: iiu_pc_ack = pc_load_ack,183: iiu_imem_ack = imem_cancel_ack,184: iiu_iq_ack = iq_cancel_ack,185: iiu_res_ack = restable_res_ack,186: iiu_rfile_ack = regdwb_rd_ack,187: iiu_ckf_ack = ckf_in_ack,188: iiu_alu_ack = alu_comp_ack,189: iiu_mem_ack = memdwb_in_ack;190:191: wire [SEQ_WIDTH-1:0]192: restable_seq_num = iiu_seq_num;193: wire [REG_WIDTH-1:0]194: restable_reg_w = iiu_reg_rd,195: restable_reg_r1 = iiu_reg_rs,196: restable_reg_r2 = iiu_reg_rt,197: restable_reg_clr = regdwb_out_reg;198: wire restable_res_req = iiu_res_req,199: restable_clr_req = regdwb_clr_req;200:201: wire [SEQ_WIDTH-1:0]202: alu_in_seq = iiu_seq_num;203: wire [REG_WIDTH-1:0]204: alu_in_dest = iiu_reg_rd;205: wire [DATA_WIDTH:0]206: alu_in_d1 = a_bus,207: alu_in_d2 = b_bus;208: wire [FUNC_WIDTH-1:0]209: alu_in_func = iiu_alu_func;210: wire alu_comp_req = iiu_alu_req,211: alu_arb_ack = arbr_ack0,212: alu_out_ack = regdwb_in_ack;213:

103

ampire.v

214: wire [SEQ_WIDTH-1:0]215: memdwb_in_seq = iiu_seq_num;216: wire [REG_WIDTH-1:0]217: memdwb_in_reg = iiu_reg_rd;218: wire [DATA_WIDTH:0]219: memdwb_in_addr = a_bus,220: memdwb_in_data = b_bus;221: wire memdwb_in_rw_mode = iiu_mem_rw,222: memdwb_in_retry = iiu_retry,223: memdwb_in_req = iiu_mem_req,224: memdwb_chk_ack = ckm_in_ack,225: memdwb_val_req = log_valmem_req,226: memdwb_out_ack = dmem_in_ack,227: memdwb_dq_ack = dq_in_ack;228:229: wire [ADDR_WIDTH:0]230: dmem_addr = memdwb_out_addr;231: wire dmem_rw_mode = memdwb_rw_mode,232: dmem_retry = memdwb_out_retry,233: dmem_in_req = memdwb_out_req,234: dmem_out_ack = dq_in_ack;235:236: wire [SEQ_WIDTH-1:0]237: dq_in_seq = memdwb_out_seq;238: wire [REG_WIDTH-1:0]239: dq_in_reg = memdwb_out_reg;240: wire [DATA_WIDTH:0]241: dq_in_data = d_bus;242: wire dq_in_req = tri_dq_req,243: dq_arb_ack = arbr_ack1,244: dq_out_ack = regdwb_in_ack;245:246: wire arbr_req0 = alu_arb_req,247: arbr_req1 = dq_arb_req;248:249: wire [SEQ_WIDTH-1:0]250: regdwb_in_seq = r_bus_seq;251: wire [REG_WIDTH-1:0]252: regdwb_in_reg = r_bus_dest,253: regdwb_rd_reg1 = iiu_reg_rs,254: regdwb_rd_reg2 = iiu_reg_rt;255: wire [DATA_WIDTH:0]256: regdwb_in_data = r_bus_data,257: regdwb_rf_data1 = rfile_rd_data1,258: regdwb_rf_data2 = rfile_rd_data2;259: wire regdwb_sim_f = iiu_sim_f;260: wire regdwb_in_req = r_bus_req,261: regdwb_chk_ack = ckr_in_ack,262: regdwb_clr_ack = restable_clr_ack,263: regdwb_val_req = log_valreg_req,264: regdwb_rd_req = iiu_rfile_req,265: regdwb_rdf_ack = rfile_rd_ack,266: regdwb_wrf_ack = rfile_wr_ack;267:268: wire [REG_WIDTH-1:0]269: rfile_wr_reg = regdwb_wrf_reg,270: rfile_rd_reg1 = iiu_reg_rs,271: rfile_rd_reg2 = iiu_reg_rt;272: wire [DATA_WIDTH:0]273: rfile_wr_data = regdwb_wrf_data;274: wire rfile_wr_req = regdwb_wrf_req,275: rfile_rd_req = regdwb_rdf_req;276:277: wire [SEQ_WIDTH-1:0]278: log_seq = iiu_seq_num,279: log_chk_seq = k_bus_seq;280: wire [CHECKERS-1:0]281: log_chkbits = iiu_chkbits;282: wire [CHKID_WIDTH-1:0]283: log_chk_chkid = k_bus_chkid;284: wire log_chk_error = k_bus_error;

104

ampire.v

285: wire log_req = iiu_log_req,286: log_chk_req = k_bus_req,287: log_valmem_ack = memdwb_val_ack,288: log_valreg_ack = regdwb_val_ack,289: log_halt_ack = ch_out,290: log_roll_ack = cr_out;291:292: wire [SEQ_WIDTH-1:0]293: cki_in_seq = iiu_seq_num;294: wire [DATA_WIDTH:0]295: cki_in_data = b_bus;296: wire cki_in_req = iiu_cki_req,297: cki_arb_ack = arbk_ack0,298: cki_out_ack = log_chk_ack;299:300: wire [SEQ_WIDTH-1:0]301: ckf_in_seq = iiu_seq_num;302: wire [DATA_WIDTH:0]303: ckf_in_data1 = a_bus,304: ckf_in_data2 = b_bus;305: wire ckf_in_req = iiu_ckf_req,306: ckf_arb_ack = arbk_ack1,307: ckf_out_ack = log_chk_ack;308:309: wire [SEQ_WIDTH-1:0]310: ckm_in_seq = iiu_seq_num;311: wire [DATA_WIDTH:0]312: ckm_in_data1 = a_bus,313: ckm_in_data2 = b_bus;314: wire ckm_in_req = memdwb_chk_req,315: ckm_arb_ack = arbk_ack2,316: ckm_out_ack = log_chk_ack;317:318: wire [SEQ_WIDTH-1:0]319: ckr_in_seq = regdwb_out_seq;320: wire [DATA_WIDTH:0]321: ckr_in_data = regdwb_out_data;322: wire ckr_in_req = regdwb_chk_req,323: ckr_arb_ack = arbk_ack3,324: ckr_out_ack = log_chk_ack;325:326: wire arbk_req0 = cki_arb_req,327: arbk_req1 = ckf_arb_req,328: arbk_req2 = ckm_arb_req,329: arbk_req3 = ckr_arb_req;330:331: // ====================================================================332: // components333:334: pc pc (pc_load, pc_in_retry, pc_load_req, pc_load_ack, pc_out, pc_out_retry,335: pc_out_req, pc_out_ack, reset);336:337: imem imem (imem_addr, imem_retry, imem_in_req, imem_in_ack, imem_data,338: imem_out_req, imem_out_ack, imem_cancel_req, imem_cancel_ack, reset);339:340: iq iq (iq_in_data, iq_in_req, iq_in_ack, i_bus, iq_out_req,341: iq_out_ack, iq_en_out, iq_cancel_req, iq_cancel_ack, reset);342:343: iiu iiu (i_bus, iiu_inst_req, iiu_inst_ack, iiu_inst_en,344: iiu_chkbits, iiu_log_req, iiu_log_ack, iiu_cki_req, iiu_cki_ack,345: iiu_newpc_req, iiu_pc_ack, iiu_imem_ack, iiu_iq_ack, iiu_seq_num,346: iiu_reg_rd, iiu_reg_rs, iiu_reg_rt, iiu_res_req, iiu_res_ack,347: iiu_sim_f, a_bus, b_bus, iiu_rfile_req, iiu_rfile_ack,348: iiu_ckf_req, iiu_ckf_ack, iiu_alu_func,349: iiu_alu_req, iiu_alu_ack, iiu_mem_rw, iiu_mem_req, iiu_mem_ack,350: val_seq, halt_req, iiu_halt_ack, roll_req, iiu_roll_ack,351: iiu_retry, stop, reset);352:353: restable restable (restable_seq_num, restable_reg_w, restable_reg_r1,354: restable_reg_r2, restable_res_req, restable_res_ack,355: restable_reg_clr, restable_clr_req, restable_clr_ack,

105

ampire.v

356: val_seq, halt_req, restable_halt_ack, roll_req, restable_roll_ack,357: reset);358:359: alu alu (alu_in_seq, alu_in_dest, alu_in_d1, alu_in_d2, alu_in_func,360: alu_comp_req, alu_comp_ack, alu_arb_req, alu_arb_ack, r_bus_seq,361: r_bus_dest, r_bus_data, r_bus_req, alu_out_ack,362: val_seq, halt_req, alu_halt_ack, roll_req, alu_roll_ack, reset);363:364: memdwb memdwb (memdwb_in_seq, memdwb_in_reg, memdwb_in_addr, memdwb_in_data,365: memdwb_in_rw_mode, memdwb_in_retry, memdwb_in_req, memdwb_in_ack,366: memdwb_chk_req, memdwb_chk_ack, val_seq, memdwb_val_req,367: memdwb_val_ack, memdwb_out_addr, d_bus, memdwb_rw_mode,368: memdwb_out_retry, memdwb_out_req, memdwb_out_ack, memdwb_out_seq,369: memdwb_out_reg, tri_dq_req, memdwb_dq_ack,370: halt_req, memdwb_halt_ack, roll_req, memdwb_roll_ack, reset);371:372: dmem dmem (dmem_addr, d_bus, dmem_rw_mode, dmem_retry, dmem_in_req,373: dmem_in_ack, tri_dq_req, dmem_out_ack,374: halt_req, dmem_halt_ack, roll_req, dmem_roll_ack, reset);375:376: dq dq (dq_in_seq, dq_in_reg, dq_in_data, dq_in_req, dq_in_ack, dq_arb_req,377: dq_arb_ack, r_bus_seq, r_bus_dest, r_bus_data, r_bus_req, dq_out_ack,378: val_seq, halt_req, dq_halt_ack, roll_req, dq_roll_ack, reset);379:380: arbr arbr (arbr_req0, arbr_ack0, arbr_req1, arbr_ack1,381: halt_req, arbr_halt_ack, roll_req, arbr_roll_ack, reset);382:383: regdwb regdwb (regdwb_in_seq, regdwb_in_reg, regdwb_in_data, regdwb_in_req,384: regdwb_in_ack, regdwb_out_seq, regdwb_out_reg, regdwb_out_data,385: regdwb_chk_req, regdwb_chk_ack, regdwb_clr_req, regdwb_clr_ack,386: val_seq, regdwb_val_req, regdwb_val_ack,387: regdwb_rd_reg1, regdwb_rd_reg2, regdwb_sim_f, a_bus, b_bus,388: regdwb_rd_req, regdwb_rd_ack, regdwb_rf_data1, regdwb_rf_data2,389: regdwb_rdf_req, regdwb_rdf_ack, regdwb_wrf_reg, regdwb_wrf_data,390: regdwb_wrf_req, regdwb_wrf_ack,391: halt_req, regdwb_halt_ack, roll_req, regdwb_roll_ack, reset);392:393: regfile regfile (rfile_wr_reg, rfile_wr_data, rfile_wr_req, rfile_wr_ack,394: rfile_rd_reg1, rfile_rd_reg2, rfile_rd_data1, rfile_rd_data2,395: rfile_rd_req, rfile_rd_ack,396: halt_req, rfile_halt_ack, roll_req, rfile_roll_ack, reset);397:398: log log (log_seq, log_chkbits, a_bus, log_req, log_ack, log_chk_seq,399: log_chk_chkid, log_chk_error, log_chk_req, log_chk_ack,400: val_seq, log_valmem_req, log_valmem_ack, log_valreg_req,401: log_valreg_ack, halt_req, log_halt_ack, roll_req,402: log_roll_ack, reset);403:404: cki cki (cki_in_seq, cki_in_data, cki_in_req, cki_in_ack, cki_arb_req,405: cki_arb_ack, k_bus_seq, k_bus_chkid, k_bus_error, k_bus_req,406: cki_out_ack, val_seq, halt_req, cki_halt_ack, roll_req, cki_roll_ack,407: reset);408:409: ckf ckf (ckf_in_seq, ckf_in_data1, ckf_in_data2, ckf_in_req, ckf_in_ack,410: ckf_arb_req, ckf_arb_ack, k_bus_seq, k_bus_chkid, k_bus_error,411: k_bus_req, ckf_out_ack, val_seq, halt_req, ckf_halt_ack, roll_req,412: ckf_roll_ack, reset);413:414: ckm ckm (ckm_in_seq, ckm_in_data1, ckm_in_data2, ckm_in_req, ckm_in_ack,415: ckm_arb_req, ckm_arb_ack, k_bus_seq, k_bus_chkid, k_bus_error,416: k_bus_req, ckm_out_ack, val_seq, halt_req, ckm_halt_ack, roll_req,417: ckm_roll_ack, reset);418:419: ckr ckr (ckr_in_seq, ckr_in_data, ckr_in_req, ckr_in_ack, ckr_arb_req,420: ckr_arb_ack, k_bus_seq, k_bus_chkid, k_bus_error, k_bus_req,421: ckr_out_ack, val_seq, halt_req, ckr_halt_ack, roll_req, ckr_roll_ack,422: reset);423:424: arbk arbk (arbk_req0, arbk_ack0, arbk_req1, arbk_ack1,425: arbk_req2, arbk_ack2, arbk_req3, arbk_ack3,426: halt_req, arbk_halt_ack, roll_req, arbk_roll_ack, reset);

106

ampire.v

427:428: bigc c_halt (iiu_halt_ack, restable_halt_ack, alu_halt_ack, memdwb_halt_ack,429: dmem_halt_ack, dq_halt_ack, regdwb_halt_ack, rfile_halt_ack,430: cki_halt_ack, ckf_halt_ack, ckm_halt_ack, ckr_halt_ack,431: arbr_halt_ack, arbk_halt_ack, ch_out, reset);432:433: bigc c_roll (iiu_roll_ack, restable_roll_ack, alu_roll_ack, memdwb_roll_ack,434: dmem_roll_ack, dq_roll_ack, regdwb_roll_ack, rfile_roll_ack,435: cki_roll_ack, ckf_roll_ack, ckm_roll_ack, ckr_roll_ack,436: arbr_roll_ack, arbk_roll_ack, cr_out, reset);437:438: // ====================================================================439: // test setup440:441: initial442: begin443: $freeze_waves; // saves simulation time444: $gr_waves_memsize (1024 * 1024);445: $gr_position ("waves", 50, 0, 1050, 850);446:447: $gr_waves (448: "pc_load", pc_load_req, // instruction fetch449: "pc_out", pc_out,450: "im_req", pc_out_req,451: "im_ack", pc_out_ack,452: "im_inst", imem_data,453: "iq_req", imem_out_req,454: "iq_ack", imem_out_ack,455: "I_bus", i_bus,456: "iiu_req", iiu_inst_req,457: "iiu_ack", iiu_inst_ack,458:459: "seq_num", iiu_seq_num, // instruction preparation460: "reg_rs", iiu_reg_rs,461: "reg_rt", iiu_reg_rt,462: "reg_rd", iiu_reg_rd,463: "log_req", iiu_log_req,464: "log_ack", iiu_log_ack,465: "cki_req", iiu_cki_req,466: "res_req", iiu_res_req,467: "rf_req", iiu_rfile_req,468:469: "A_bus", a_bus, // instruction issue & exec470: "A_bus_L", a_bus [ADDR_WIDTH:0],471: "B_bus", b_bus,472: "func", iiu_alu_func,473: "alu_req", iiu_alu_req,474: "mem_req", iiu_mem_req,475:476: "dmem_rq", memdwb_out_req,477: "dq_req", dq_in_req,478:479: "R_seq", r_bus_seq, // register file access480: "R_data", r_bus_data,481: "R_dest", r_bus_dest,482: "alu_arb", alu_arb_req,483: "dq_arb", dq_arb_req,484:485: "clr_req", regdwb_clr_req,486: "wrf_reg", rfile_wr_reg,487: "wrf_req", rfile_wr_req,488:489: "cki_arb", cki_arb_req, // error detection490: "ckf_arb", ckf_arb_req,491: "ckm_arb", ckm_arb_req,492: "ckr_arb", ckr_arb_req,493: "chk_ack", log_chk_ack,494: "K_seq", k_bus_seq,495: "K_chkid", k_bus_chkid,496: "K_err", k_bus_error,497:

107

ampire.v

498: "val_seq", val_seq, // validation & rollback499: "val_mem", log_valmem_req,500: "val_reg", log_valreg_req,501: "hlt_req", halt_req,502: "hlt_ack", log_halt_ack,503: "rol_req", roll_req,504: "rol_ack", log_roll_ack);505:506: $define_group_waves (1, "tst1a",507: "pc_load", "pc_out", "im_req", "im_ack", "im_inst",508: "iq_req", "iq_ack", "I_bus", "iiu_req", "iiu_ack", "seq_num");509:510: $define_group_waves (2, "tst1b",511: "pc_load", "pc_out", "im_req", "im_inst", "iq_req", "I_bus",512: "iiu_req", "iiu_ack", "seq_num", "reg_rs", "reg_rt", "reg_rd",513: "log_req", "cki_req", "res_req", "rf_req", "func", "alu_req",514: "R_seq", "R_data", "R_dest", "alu_arb", "clr_req", "wrf_reg",515: "wrf_req", "K_seq", "K_chkid", "val_seq", "val_reg");516:517: $define_group_waves (3, "tst2a",518: "I_bus", "iiu_ack", "seq_num", "reg_rs", "reg_rt", "reg_rd",519: "log_req", "res_req", "rf_req", "alu_req", "mem_req", "dmem_rq",520: "dq_req", "R_seq", "R_data", "R_dest", "alu_arb", "dq_arb",521: "clr_req", "wrf_reg", "wrf_req", "val_seq", "val_reg", "val_mem");522:523: $define_group_waves (4, "tst2b",524: "seq_num", "log_req", "log_ack", "K_seq", "cki_arb", "ckf_arb",525: "ckm_arb", "ckr_arb", "chk_ack");526:527: $define_group_waves (5, "tst3a",528: "I_bus", "iiu_ack", "seq_num", "reg_rd", "log_req", "res_req",529: "rf_req", "alu_req", "mem_req", "dmem_rq", "dq_req", "R_seq",530: "R_data", "R_dest", "alu_arb", "dq_arb", "clr_req", "wrf_reg",531: "wrf_req", "val_seq", "val_reg");532:533: $define_group_waves (6, "tst4a",534: "pc_load", "pc_out", "I_bus", "iiu_req", "iiu_ack", "seq_num",535: "reg_rd", "log_req", "res_req", "rf_req", "func", "alu_req",536: "mem_req", "dmem_rq", "dq_req", "R_seq", "R_data", "R_dest",537: "alu_arb", "dq_arb", "clr_req", "wrf_reg", "wrf_req", "K_seq",538: "K_chkid", "K_err", "val_seq", "val_reg", "val_mem", "hlt_req");539:540: $define_group_waves (7, "tst4b",541: "pc_load", "pc_out", "I_bus", "iiu_req", "iiu_ack", "seq_num",542: "reg_rd", "log_req", "res_req", "rf_req", "func", "alu_req",543: "R_seq", "R_data", "R_dest", "alu_arb", "clr_req", "wrf_reg",544: "wrf_req", "K_seq", "K_chkid", "K_err", "val_seq", "val_reg",545: "hlt_req");546:547: $define_group_waves (8, "tst4c",548: "pc_load", "pc_out", "I_bus", "iiu_req", "iiu_ack", "seq_num",549: "reg_rd", "log_req", "res_req", "rf_req", "K_seq", "K_chkid",550: "K_err", "val_seq", "hlt_req");551:552: $define_group_waves (9, "tst4d",553: "pc_load", "pc_out", "I_bus", "iiu_req", "iiu_ack", "seq_num",554: "reg_rd", "log_req", "res_req", "rf_req", "func", "alu_req",555: "R_seq", "R_data", "R_dest", "alu_arb", "clr_req", "wrf_reg",556: "wrf_req", "K_seq", "K_chkid", "K_err", "val_seq", "val_reg",557: "hlt_req");558:559: $define_group_waves (10, "tst4e",560: "pc_load", "pc_out", "I_bus", "iiu_req", "iiu_ack", "seq_num",561: "reg_rd", "log_req", "res_req", "rf_req", "mem_req", "dmem_rq",562: "K_seq", "K_chkid", "K_err", "val_seq", "val_mem", "hlt_req");563:564: $define_group_waves (11, "tst4f",565: "pc_load", "pc_out", "I_bus", "iiu_req", "iiu_ack", "seq_num",566: "reg_rd", "log_req", "res_req", "rf_req", "mem_req", "dmem_rq",567: "dq_req", "R_seq", "R_data", "R_dest", "dq_arb", "clr_req",568: "wrf_req", "K_seq", "K_chkid", "K_err", "val_seq", "val_reg",

108

ampire.v

569: "hlt_req");570:571: $define_group_waves (12, "tst4g",572: "val_seq", "hlt_req", "hlt_ack", "rol_req", "rol_ack", "seq_num",573: "A_bus_L", "B_bus", "log_req", "cki_req", "pc_load", "pc_out");574:575: $define_group_waves (13, "tst5a",576: "pc_load", "pc_out", "I_bus", "iiu_req", "iiu_ack", "seq_num",577: "log_req", "res_req", "rf_req", "func", "alu_req", "mem_req",578: "dmem_rq", "dq_req", "R_seq", "R_dest", "alu_arb", "dq_arb",579: "clr_req", "wrf_reg", "wrf_req", "cki_arb", "ckf_arb", "ckm_arb",580: "ckr_arb", "K_seq", "K_chkid", "K_err", "val_seq", "val_reg",581: "val_mem", "hlt_req");582:583: $define_group_waves (20, "all",584: "pc_load", "pc_out", "im_req", "im_ack", "im_inst", "iq_req",585: "iq_ack", "I_bus", "iiu_req", "iiu_ack", "seq_num", "reg_rs",586: "reg_rt", "reg_rd", "log_req", "log_ack", "cki_req", "res_req",587: "rf_req", "A_bus", "A_bus_L", "B_bus", "func", "alu_req", "mem_req",588: "dmem_rq", "dq_req", "R_seq", "R_data", "R_dest", "alu_arb",589: "dq_arb", "clr_req", "wrf_reg", "wrf_req", "cki_arb", "ckf_arb",590: "ckm_arb", "ckr_arb", "chk_ack", "K_seq", "K_chkid", "K_err",591: "val_seq", "val_mem", "val_reg", "hlt_req", "hlt_ack", "rol_req",592: "rol_ack");593:594: #1; // for #0 $gr_addwaves595: $define_group_waves (15, "arbr",596: "R_seq", "alu_arb", "dq_arb", "none",597: "req0", "v0", "ack0", "req1", "v1", "ack1");598:599: $define_group_waves (16, "iq",600: "pc_load", "pc_out", "none", "in_d", "out_d", "in_r", "in_a",601: "out_r", "out_a", "can_r", "can_a", "none", "latch", "osc", "q1",602: "q2", "same1", "same2", "done", "regclk", "complet");603:604: $define_group_waves (17, "dq",605: "in_r", "cout1", "latch", "in_a", "a", "cout2", "out_r", "out_a",606: "h_req", "h_ack", "r_req", "clear1", "clear2", "r_ack", "none",607: "seq", "error", "req", "_a", "na", "b", "nb", "c", "nc", "d",608: "inv", "ack");609: end610:611: initial612: begin613: reset = 1;614: #RESET_TIME;615: reset = 0;616:617: fork618: begin // normal termination619: wait (stop);620: #400 $stop;621: end622: begin // abnormal termination623: #5000;624: $display ("AMPIRE: exceeded maximum simulation time.");625: $stop;626: end627: join628: end629:630: endmodule // ampire631:632: ‘include "pc.v"633: ‘include "imem.v"634: ‘include "iq.v"635: ‘include "iiu.v"636: ‘include "restable.v"637: ‘include "alu.v"638: ‘include "memdwb.v"639: ‘include "dmem.v"

109

ampire.v

640: ‘include "dq.v"641: ‘include "arbr.v"642: ‘include "regdwb.v"643: ‘include "regfile.v"644: ‘include "cki.v"645: ‘include "ckf.v"646: ‘include "ckm.v"647: ‘include "ckr.v"648: ‘include "arbk.v"649: ‘include "log.v"650: ‘include "bigc.v"651: ‘include "gates.v"

110

alu.v

1: // ALU (combining ALUQ and ALUF)2:3: module alu (in_seq, in_dest, in_d1, in_d2, in_func, comp_req, comp_ack,4: arb_req, arb_ack, tri_seq, tri_dest, tri_data, tri_req, out_ack,5: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);6:7: ‘include "parameter"8:9: input [SEQ_WIDTH-1:0] in_seq, val_seq;10: input [REG_WIDTH-1:0] in_dest;11: input [DATA_WIDTH:0] in_d1, in_d2;12: input [FUNC_WIDTH-1:0] in_func;13: output [SEQ_WIDTH-1:0] tri_seq;14: output [REG_WIDTH-1:0] tri_dest;15: output [DATA_WIDTH:0] tri_data;16: input comp_req, arb_ack, out_ack, halt_req, roll_req, reset;17: output comp_ack, arb_req, tri_req, halt_ack, roll_ack;18:19: reg halt_ack, roll_ack;20:21: wire [SEQ_WIDTH-1:0] mid_seq;22: wire [REG_WIDTH-1:0] mid_dest;23: wire [DATA_WIDTH:0] mid_d1, mid_d2;24: wire [FUNC_WIDTH-1:0] mid_func;25:26: aluq aluq (in_seq, in_dest, in_d1, in_d2, in_func, comp_req, comp_ack,27: mid_seq, mid_dest, mid_d1, mid_d2, mid_func, mid_req, mid_ack,28: val_seq, halt_req, haltq_ack, roll_req, rollq_ack, reset);29: aluf aluf (mid_seq, mid_dest, mid_d1, mid_d2, mid_func, mid_req, mid_ack,30: arb_req, arb_ack, tri_seq, tri_dest, tri_data, tri_req, out_ack,31: val_seq, halt_req, haltf_ack, roll_req, rollf_ack, reset);32:33: always wait (reset)34: begin35: halt_ack = 0;36: roll_ack = 0;37: wait (˜reset);38: end39:40: always @(haltq_ack or haltf_ack)41: if (haltq_ack & haltf_ack)42: halt_ack = 1;43: else if (˜haltq_ack & ˜haltf_ack)44: halt_ack = 0;45:46: always @(rollq_ack or rollf_ack)47: if (rollq_ack & rollf_ack)48: roll_ack = 1;49: else if (˜rollq_ack & ˜rollf_ack)50: roll_ack = 0;51:52: endmodule // alu53:54: // ====================================================================55: // ALUF (functional part)56:57: module aluf (in_seq, in_dest, in_d1, in_d2, in_func, comp_req, comp_ack,58: arb_req, arb_ack, tri_seq, tri_dest, tri_data, tri_req, out_ack,59: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);60:61: ‘include "parameter"62:63: input [SEQ_WIDTH-1:0] in_seq, val_seq;64: input [REG_WIDTH-1:0] in_dest;65: input [DATA_WIDTH:0] in_d1, in_d2;66: input [FUNC_WIDTH-1:0] in_func;67: output [SEQ_WIDTH-1:0] tri_seq;68: output [REG_WIDTH-1:0] tri_dest;69: output [DATA_WIDTH:0] tri_data;70: input comp_req, arb_ack, out_ack, halt_req, roll_req, reset;71: output comp_ack, arb_req, tri_req, halt_ack, roll_ack;

111

alu.v

72:73: reg [SEQ_WIDTH-1:0] out_seq;74: reg [REG_WIDTH-1:0] out_dest;75: reg [DATA_WIDTH:0] out_data;76: reg comp_ack, arb_req, out_req, halt_ack, roll_ack;77:78: reg [DATA_WIDTH-1:0] data1, data2;79: reg [FUNC_WIDTH-1:0] func;80: reg [SEQ_WIDTH-1:0] diff;81: reg [5:0] loop; // DATA_WIDTH = 32 bits max82: reg valid, parity;83:84: assign tri_seq = arb_ack ? out_seq : ZZZ;85: assign tri_dest = arb_ack ? out_dest : ZZZ;86: assign tri_data = arb_ack ? out_data : ZZZ;87: assign tri_req = arb_ack ? out_req : ZZZ;88:89: always wait (reset)90: begin91: disable rollback_cycle;92: disable latch_cycle;93: disable output_cycle;94: out_seq = XXX;95: out_dest = XXX;96: out_data = XXX;97: comp_ack = 0;98: arb_req = 0;99: out_req = 0;100: halt_ack = 0;101: roll_ack = 0;102: valid = 0;103: wait (˜reset);104: end105:106: always wait (halt_req & ˜reset)107: begin :rollback_cycle108: #1;109: halt_ack = 1;110: wait (roll_req);111: #1;112: disable latch_cycle;113: disable output_cycle;114: comp_ack = 0;115: arb_req = 0;116: out_req = 0;117:118: #DLY_SEQ_COMP;119: diff = out_seq - val_seq; // compare sequence numbers120: if (˜diff [SEQ_WIDTH-1])121: begin122: valid = 0;123: out_seq = XXX;124: out_dest = XXX;125: out_data = XXX;126: end127:128: roll_ack = 1;129: fork130: begin131: wait (˜roll_req);132: #1;133: roll_ack = 0;134: end135: begin136: wait (˜halt_req);137: #1;138: halt_ack = 0;139: end140: join141: end142:

112

alu.v

143: always wait (comp_req & ˜valid & ˜halt_req & ˜reset)144: begin :latch_cycle145: #1;146: wait (˜halt_req);147: data1 = in_d1 [DATA_WIDTH-1:0];148: data2 = in_d2 [DATA_WIDTH-1:0];149: func = in_func;150: out_seq = in_seq;151: out_dest = in_dest;152: comp_ack = 1;153: valid = 1;154: wait (˜comp_req);155: #1;156: wait (˜halt_req);157: comp_ack = 0;158: end159:160: always wait (valid & ˜halt_req & ˜reset)161: begin :output_cycle162: #1;163: compute;164: arb_req = 1;165: wait (arb_ack);166: #1;167: wait (˜halt_req);168: out_req = 1;169: wait (out_ack);170: #1;171: valid = 0;172: arb_req = 0;173: wait (˜halt_req);174: out_req = 0;175: wait (˜out_ack & ˜arb_ack);176: end177:178: task compute;179: begin180: case (func)181: FUNC_ADDU, FUNC_ADDUF:182: #DLY_ALU_ADD out_data = data1 + data2;183: FUNC_SUBU: #DLY_ALU_SUB out_data = data1 - data2;184:185: FUNC_AND: #DLY_ALU_AND out_data = data1 & data2;186: FUNC_OR: #DLY_ALU_OR out_data = data1 | data2;187: FUNC_XOR: #DLY_ALU_XOR out_data = data1 ˆ data2;188: FUNC_PASS: #DLY_ALU_PASS out_data = data2;189: FUNC_LHI: #DLY_ALU_SHIFT out_data = data2 << IMM_WIDTH;190: FUNC_SLL: #DLY_ALU_SHIFT out_data = data1 << (data2 % DATA_WIDTH);191: FUNC_SRL: #DLY_ALU_SHIFT out_data = data1 >> (data2 % DATA_WIDTH);192: FUNC_SRA: #DLY_ALU_SHIFT193: for (loop=0; loop<(data2 % DATA_WIDTH); loop=loop+1)194: out_data = {data1[DATA_WIDTH-1], data1[DATA_WIDTH-1:1]};195:196: FUNC_SLT: #DLY_ALU_COMP out_data = (data1 < data2);197: FUNC_SGT: #DLY_ALU_COMP out_data = (data1 > data2);198: FUNC_SLE: #DLY_ALU_COMP out_data = (data1 <= data2);199: FUNC_SGE: #DLY_ALU_COMP out_data = (data1 >= data2);200: FUNC_SEQ: #DLY_ALU_COMP out_data = (data1 == data2);201: FUNC_SNE: #DLY_ALU_COMP out_data = (data1 != data2);202:203: default: out_data = XXX;204: endcase205:206: parity = (func == FUNC_ADDUF); // bad parity207: for (loop=0; loop<DATA_WIDTH; loop=loop+1) // calculate parity208: parity = parity ˆ out_data [loop];209: out_data [DATA_WIDTH] = parity;210: end211: endtask212:213: endmodule // aluf

113

alu.v

214:215: // ====================================================================216: // ALU Input Queue217:218: module aluq (in_seq, in_dest, in_d1, in_d2, in_func, in_req, in_ack,219: out_seq, out_dest, out_d1, out_d2, out_func, out_req, out_ack,220: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);221:222: ‘include "parameter"223:224: input [SEQ_WIDTH-1:0] in_seq, val_seq;225: input [REG_WIDTH-1:0] in_dest;226: input [DATA_WIDTH:0] in_d1, in_d2;227: input [FUNC_WIDTH-1:0] in_func;228: output [SEQ_WIDTH-1:0] out_seq;229: output [REG_WIDTH-1:0] out_dest;230: output [DATA_WIDTH:0] out_d1, out_d2;231: output [FUNC_WIDTH-1:0] out_func;232: input in_req, out_ack, halt_req, roll_req, reset;233: output in_ack, out_req, halt_ack, roll_ack;234:235: reg halt_ack, roll_ack;236:237: wire [SEQ_WIDTH-1:0] seq1;238: wire [REG_WIDTH-1:0] dest1;239: wire [DATA_WIDTH:0] d1_1, d2_1;240: wire [FUNC_WIDTH-1:0] func1;241:242: aluq_buffer buf1 (in_seq, in_dest, in_d1, in_d2, in_func, in_req, in_ack,243: seq1, dest1, d1_1, d2_1, func1, req1, ack1,244: val_seq, halt_req, halt_a1, roll_req, roll_a1, reset);245: aluq_buffer buf2 (seq1, dest1, d1_1, d2_1, func1, req1, ack1, out_seq,246: out_dest, out_d1, out_d2, out_func, out_req, out_ack,247: val_seq, halt_req, halt_a2, roll_req, roll_a2, reset);248:249: always wait (reset)250: begin251: halt_ack = 0;252: roll_ack = 0;253: wait (˜reset);254: end255:256: always @(halt_a1 or halt_a2)257: if (halt_a1 & halt_a2)258: halt_ack = 1;259: else if (˜halt_a1 & ˜halt_a2)260: halt_ack = 0;261:262: always @(roll_a1 or roll_a2)263: if (roll_a1 & roll_a2)264: roll_ack = 1;265: else if (˜roll_a1 & ˜roll_a2)266: roll_ack = 0;267:268: endmodule // aluq269:270: // ====================================================================271:272: module aluq_buffer (in_s, in_ds, in_d1, in_d2, in_f, in_r, in_a,273: out_s, out_ds, out_d1, out_d2, out_f, out_r, out_a,274: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);275:276: ‘include "parameter"277:278: input [SEQ_WIDTH-1:0] in_s, val_seq;279: input [REG_WIDTH-1:0] in_ds;280: input [DATA_WIDTH:0] in_d1, in_d2;281: input [FUNC_WIDTH-1:0] in_f;282: output [SEQ_WIDTH-1:0] out_s;283: output [REG_WIDTH-1:0] out_ds;284: output [DATA_WIDTH:0] out_d1, out_d2;

114

alu.v

285: output [FUNC_WIDTH-1:0] out_f;286: input in_r, out_a, halt_req, roll_req, reset;287: output in_a, out_r, halt_ack, roll_ack;288:289: reg [SEQ_WIDTH-1:0] out_s;290: reg [REG_WIDTH-1:0] out_ds;291: reg [DATA_WIDTH:0] out_d1, out_d2;292: reg [FUNC_WIDTH-1:0] out_f;293: reg in_a, out_r, halt_ack, roll_ack;294:295: reg [SEQ_WIDTH-1:0] diff;296: reg valid;297:298: always wait (reset)299: begin300: disable rollback_cycle;301: disable input_cycle;302: disable output_cycle;303: out_s = XXX;304: out_ds = XXX;305: out_d1 = XXX;306: out_d2 = XXX;307: out_f = XXX;308: in_a = 0;309: out_r = 0;310: halt_ack = 0;311: roll_ack = 0;312: valid = 0;313: wait (˜reset);314: end315:316: always wait (halt_req & ˜reset)317: begin :rollback_cycle318: #1;319: halt_ack = 1;320: wait (roll_req);321: #1;322: disable input_cycle;323: disable output_cycle;324: in_a = 0;325: out_r = 0;326:327: #DLY_SEQ_COMP;328: diff = out_s - val_seq; // compare sequence numbers329: if (˜diff [SEQ_WIDTH-1])330: begin331: valid = 0;332: out_s = XXX;333: out_ds = XXX;334: out_d1 = XXX;335: out_d2 = XXX;336: out_f = XXX;337: end338:339: roll_ack = 1;340: fork341: begin342: wait (˜roll_req);343: #1;344: roll_ack = 0;345: end346: begin347: wait (˜halt_req);348: #1;349: halt_ack = 0;350: end351: join352: end353:354: always wait (in_r & ˜valid & ˜halt_req & ˜reset)355: begin :input_cycle

115

alu.v

356: #1;357: wait (˜halt_req);358: out_s = in_s;359: out_ds = in_ds;360: out_d1 = in_d1;361: out_d2 = in_d2;362: out_f = in_f;363: in_a = 1;364: valid = 1;365: wait (˜in_r);366: #1;367: wait (˜halt_req);368: in_a = 0;369: end370:371: always wait (valid & ˜halt_req & ˜reset)372: begin :output_cycle373: #1;374: wait (˜halt_req);375: out_r = 1;376: wait (out_a);377: #1;378: valid = 0;379: wait (˜halt_req);380: out_r = 0;381: wait (˜out_a);382: end383:384: endmodule // aluq_buffer

116

arbk.v

1: // Arbiter for K_bus2:3: module arbk (req0, ack0, req1, ack1, req2, ack2, req3, ack3,4: halt_req, halt_ack, roll_req, roll_ack, reset);5:6: ‘include "parameter"7:8: input req0, req1, req2, req3, halt_req, roll_req, reset;9: output ack0, ack1, ack2, ack3, halt_ack, roll_ack;10:11: reg ack0, ack1, ack2, ack3, halt_ack, roll_ack;12:13: reg [CHKID_WIDTH-1:0] grant;14: reg found;15:16: always wait (reset)17: begin18: disable rollback_cycle;19: disable arb_cycle;20: ack0 = 0;21: ack1 = 0;22: ack2 = 0;23: ack3 = 0;24: halt_ack = 0;25: roll_ack = 0;26: grant = 0;27: wait (˜reset);28: end29:30: always wait (halt_req & ˜reset)31: begin :rollback_cycle32: #1;33: halt_ack = 1;34: wait (roll_req);35: #1;36: disable arb_cycle;37: ack0 = 0;38: ack1 = 0;39: ack2 = 0;40: ack3 = 0;41: grant = 0;42:43: roll_ack = 1;44: fork45: begin46: wait (˜roll_req);47: #1;48: roll_ack = 0;49: end50: begin51: wait (˜halt_req);52: #1;53: halt_ack = 0;54: end55: join56: end57:58: always wait ((req0 | req1 | req2 | req3) & ˜halt_req & ˜reset)59: begin :arb_cycle60: found = 0;61: while (˜found)62: begin63: grant = grant + 1; // scan request signals64: if (grant >= CHECKERS)65: grant = 0;66: case (grant)67: 0: if (req0) found = 1;68: 1: if (req1) found = 1;69: 2: if (req2) found = 1;70: 3: if (req3) found = 1;71: endcase

117

arbk.v

72: end73:74: #1;75: case (grant)76: 0: begin77: ack0 = 1;78: wait (˜req0);79: #1;80: ack0 = 0;81: end82: 1: begin83: ack1 = 1;84: wait (˜req1);85: #1;86: ack1 = 0;87: end88: 2: begin89: ack2 = 1;90: wait (˜req2);91: #1;92: ack2 = 0;93: end94: 3: begin95: ack3 = 1;96: wait (˜req3);97: #1;98: ack3 = 0;99: end100: endcase101: end102:103: endmodule // arbk

118

arbr.v

1: // Arbiter for R_bus2:3: module arbr (req0, ack0, req1, ack1,4: halt_req, halt_ack, roll_req, roll_ack, reset);5:6: ‘include "parameter"7:8: input req0, req1, halt_req, roll_req, reset;9: output ack0, ack1, halt_ack, roll_ack;10:11: ‘ifdef BEHAV_ARBR // behavioral level12:13: reg ack0, ack1, halt_ack, roll_ack;14:15: reg grant;16:17: always wait (reset)18: begin19: disable rollback_cycle;20: disable arb_cycle;21: ack0 = 0;22: ack1 = 0;23: halt_ack = 0;24: roll_ack = 0;25: grant = 0;26: wait (˜reset);27: end28:29: always wait (halt_req & ˜reset)30: begin :rollback_cycle31: #1;32: halt_ack = 1;33: wait (roll_req);34: #1;35: disable arb_cycle;36: ack0 = 0;37: ack1 = 0;38: grant = 0;39:40: roll_ack = 1;41: fork42: begin43: wait (˜roll_req);44: #1;45: roll_ack = 0;46: end47: begin48: wait (˜halt_req);49: #1;50: halt_ack = 0;51: end52: join53: end54:55: always wait ((req0 | req1) & ˜halt_req & ˜reset)56: begin :arb_cycle57: if (req0 & req1)58: grant = ˜grant; // switch to other one59: else60: grant = req1; // 0 if not req161:62: #1;63: if (grant)64: begin65: ack1 = 1;66: wait (˜req1);67: #1;68: ack1 = 0;69: end70: else71: begin

119

arbr.v

72: ack0 = 1;73: wait (˜req0);74: #1;75: ack0 = 0;76: end77: end78:79: ‘else // gate level80:81: and #1 g1 (v0, req0, ˜v1);82: and #2 g2 (v1, req1, ˜v0);83: not #1 g3 (ack0, nack0);84: not #1 g4 (ack1, nack1);85: nmos #1 m1 (nack1, v0, v1);86: nmos #1 m2 (nack0, v1, v0);87: pullup (nack0), (nack1);88:89: buf #1 g5 (halt_ack, halt_req);90: and #1 g6 (roll_ack, roll_req, ˜ack0, ˜ack1);91:92: ‘ifdef WAVE_ARBR93: initial94: #0 $gr_addwaves ("req0", req0, "v0", v0, "ack0", ack0,95: "req1", req1, "v1", v1, "ack1", ack1);96: ‘endif97:98: ‘endif // BEHAV_ARBR99:100: endmodule // arbr

120

bigc.v

1: // Big C-element: used for halt_ack and roll_ack2:3: module bigc (in1, in2, in3, in4, in5, in6, in7, in8, in9, in10, in11, in12,4: in13, in14, out, reset);5:6: input in1, in2, in3, in4, in5, in6, in7, in8, in9, in10, in11, in12, in13,7: in14;8: output out;9: input reset;10:11: ‘ifdef BEHAV_BIGC // behavioral level12:13: reg out;14:15: always wait (reset)16: begin17: out = 0;18: wait (˜reset);19: end20:21: always @(in1 or in2 or in3 or in4 or in5 or in6 or in7 or in8 or in9 or22: in10 or in11 or in12 or in13 or in14)23: if (in1 & in2 & in3 & in4 & in5 & in6 & in7 & in8 & in9 & in10 &24: in11 & in12 & in13 & in14)25: out = 1;26: else if (˜in1 & ˜in2 & ˜in3 & ˜in4 & ˜in5 & ˜in6 & ˜in7 & ˜in8 &27: ˜in9 & ˜in10 & ˜in11 & ˜in12 & ˜in13 & ˜in14)28: out = 0;29:30: ‘else // gate level31:32: muller4 x1 (in1, in2, in3, in4, out1);33: muller4 x2 (in5, in6, in7, in8, out2);34: muller3 x3 (in9, in10, in11, out3);35: muller3 x4 (in12, in13, in14, out4);36: muller4 x5 (out1, out2, out3, out4, out);37:38: ‘endif39:40: endmodule // bigc

121

ckf.v

1: // CKF (combining CKFQ and CKFF)2:3: module ckf (in_seq, in_data1, in_data2, in_req, in_ack, arb_req, arb_ack,4: tri_seq, tri_chkid, tri_error, tri_req, out_ack,5: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);6:7: ‘include "parameter"8:9: input [SEQ_WIDTH-1:0] in_seq, val_seq;10: input [DATA_WIDTH:0] in_data1, in_data2;11: output [SEQ_WIDTH-1:0] tri_seq;12: output [CHKID_WIDTH-1:0] tri_chkid;13: input in_req, arb_ack, out_ack, halt_req, roll_req, reset;14: output in_ack, arb_req, tri_error, tri_req, halt_ack, roll_ack;15:16: reg halt_ack, roll_ack;17:18: wire [SEQ_WIDTH-1:0] mid_seq;19: wire [DATA_WIDTH:0] mid_data1, mid_data2;20:21: ckfq ckfq (in_seq, in_data1, in_data2, in_req, in_ack,22: mid_seq, mid_data1, mid_data2, mid_req, mid_ack,23: val_seq, halt_req, haltq_ack, roll_req, rollq_ack, reset);24: ckff ckff (mid_seq, mid_data1, mid_data2, mid_req, mid_ack, arb_req, arb_ack,25: tri_seq, tri_chkid, tri_error, tri_req, out_ack,26: val_seq, halt_req, haltf_ack, roll_req, rollf_ack, reset);27:28: always wait (reset)29: begin30: halt_ack = 0;31: roll_ack = 0;32: wait (˜reset);33: end34:35: always @(haltq_ack or haltf_ack)36: if (haltq_ack & haltf_ack)37: halt_ack = 1;38: else if (˜haltq_ack & ˜haltf_ack)39: halt_ack = 0;40:41: always @(rollq_ack or rollf_ack)42: if (rollq_ack & rollf_ack)43: roll_ack = 1;44: else if (˜rollq_ack & ˜rollf_ack)45: roll_ack = 0;46:47: endmodule // ckf48:49: // ====================================================================50: // Checker for Register File (functional part)51:52: module ckff (in_seq, in_data1, in_data2, in_req, in_ack, arb_req, arb_ack,53: tri_seq, tri_chkid, tri_error, tri_req, out_ack,54: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);55:56: ‘include "parameter"57:58: input [SEQ_WIDTH-1:0] in_seq, val_seq;59: input [DATA_WIDTH:0] in_data1, in_data2;60: output [SEQ_WIDTH-1:0] tri_seq;61: output [CHKID_WIDTH-1:0] tri_chkid;62: input in_req, arb_ack, out_ack, halt_req, roll_req, reset;63: output in_ack, arb_req, tri_error, tri_req, halt_ack, roll_ack;64:65: reg [SEQ_WIDTH-1:0] out_seq;66: reg in_ack, arb_req, out_error, out_req, halt_ack, roll_ack;67:68: reg [SEQ_WIDTH-1:0] diff;69: reg [DATA_WIDTH:0] data1_buf, data2_buf;70: reg [5:0] loop;71: reg valid, parity1, parity2;

122

ckf.v

72:73: assign tri_seq = arb_ack ? out_seq : ZZZ;74: assign tri_chkid = arb_ack ? ID_CKF : ZZZ;75: assign tri_error = arb_ack ? out_error : ZZZ;76: assign tri_req = arb_ack ? out_req : ZZZ;77:78: always wait (reset)79: begin80: disable rollback_cycle;81: disable latch_cycle;82: disable output_cycle;83: out_seq = XXX;84: in_ack = 0;85: arb_req = 0;86: out_error = XXX;87: out_req = 0;88: halt_ack = 0;89: roll_ack = 0;90: valid = 0;91: wait (˜reset);92: end93:94: always wait (halt_req & ˜reset)95: begin :rollback_cycle96: #1;97: halt_ack = 1;98: wait (roll_req);99: #1;100: disable latch_cycle;101: disable output_cycle;102: in_ack = 0;103: arb_req = 0;104: out_req = 0;105:106: #DLY_SEQ_COMP;107: diff = out_seq - val_seq; // compare sequence numbers108: if (˜diff [SEQ_WIDTH-1])109: begin110: valid = 0;111: out_seq = XXX;112: out_error = XXX;113: end114:115: roll_ack = 1;116: fork117: begin118: wait (˜roll_req);119: #1;120: roll_ack = 0;121: end122: begin123: wait (˜halt_req);124: #1;125: halt_ack = 0;126: end127: join128: end129:130: always wait (in_req & ˜valid & ˜halt_req & ˜reset)131: begin :latch_cycle132: #1;133: wait (˜halt_req);134: out_seq = in_seq;135: data1_buf = in_data1;136: data2_buf = in_data2;137: in_ack = 1;138: valid = 1;139: wait (˜in_req);140: #1;141: wait (˜halt_req);142: in_ack = 0;

123

ckf.v

143: end144:145: always wait (valid & ˜halt_req & ˜reset)146: begin :output_cycle147: #1;148: parity1 = 0;149: parity2 = 0;150: for (loop=0; loop<1+DATA_WIDTH; loop=loop+1) // calculate parity151: begin152: parity1 = parity1 ˆ data1_buf [loop];153: parity2 = parity2 ˆ data2_buf [loop];154: end155: if (parity1 === 1’bx) parity1 = 1;156: if (parity2 === 1’bx) parity2 = 1;157:158: #DLY_CKF_CHK;159: out_error = parity1 | parity2;160: arb_req = 1;161: wait (arb_ack);162: #1;163: wait (˜halt_req);164: out_req = 1;165: wait (out_ack);166: #1;167: valid = 0;168: arb_req = 0;169: wait (˜halt_req);170: out_req = 0;171: wait (˜out_ack & ˜arb_ack);172: end173:174: endmodule // ckff175:176: // ====================================================================177: // Queue for Register File Checker178:179: module ckfq (in_seq, in_data1, in_data2, in_req, in_ack,180: out_seq, out_data1, out_data2, out_req, out_ack,181: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);182:183: ‘include "parameter"184:185: input [SEQ_WIDTH-1:0] in_seq, val_seq;186: input [DATA_WIDTH:0] in_data1, in_data2;187: output [SEQ_WIDTH-1:0] out_seq;188: output [DATA_WIDTH:0] out_data1, out_data2;189: input in_req, out_ack, halt_req, roll_req, reset;190: output in_ack, out_req, halt_ack, roll_ack;191:192: reg halt_ack, roll_ack;193:194: wire [SEQ_WIDTH-1:0] seq1;195: wire [DATA_WIDTH:0] data11, data21;196:197: ckf_buf buf1 (in_seq, in_data1, in_data2, in_req, in_ack,198: seq1, data11, data21, req1, ack1,199: val_seq, halt_req, halt_a1, roll_req, roll_a1, reset);200: ckf_buf buf2 (seq1, data11, data21, req1, ack1,201: out_seq, out_data1, out_data2, out_req, out_ack,202: val_seq, halt_req, halt_a2, roll_req, roll_a2, reset);203:204: always wait (reset)205: begin206: halt_ack = 0;207: roll_ack = 0;208: wait (˜reset);209: end210:211: always @(halt_a1 or halt_a2)212: if (halt_a1 & halt_a2)213: halt_ack = 1;

124

ckf.v

214: else if (˜halt_a1 & ˜halt_a2)215: halt_ack = 0;216:217: always @(roll_a1 or roll_a2)218: if (roll_a1 & roll_a2)219: roll_ack = 1;220: else if (˜roll_a1 & ˜roll_a2)221: roll_ack = 0;222:223: endmodule // ckfq224:225: // ====================================================================226:227: module ckf_buf (in_s, in_d1, in_d2, in_r, in_a,228: out_s, out_d1, out_d2, out_r, out_a,229: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);230:231: ‘include "parameter"232:233: input [SEQ_WIDTH-1:0] in_s, val_seq;234: input [DATA_WIDTH:0] in_d1, in_d2;235: output [SEQ_WIDTH-1:0] out_s;236: output [DATA_WIDTH:0] out_d1, out_d2;237: input in_r, out_a, halt_req, roll_req, reset;238: output in_a, out_r, halt_ack, roll_ack;239:240: reg [SEQ_WIDTH-1:0] out_s;241: reg [DATA_WIDTH:0] out_d1, out_d2;242: reg in_a, out_r, halt_ack, roll_ack;243:244: reg [SEQ_WIDTH-1:0] diff;245: reg valid;246:247: always wait (reset)248: begin249: disable rollback_cycle;250: disable input_cycle;251: disable output_cycle;252: out_s = XXX;253: out_d1 = XXX;254: out_d2 = XXX;255: in_a = 0;256: out_r = 0;257: halt_ack = 0;258: roll_ack = 0;259: valid = 0;260: wait (˜reset);261: end262:263: always wait (halt_req & ˜reset)264: begin :rollback_cycle265: #1;266: halt_ack = 1;267: wait (roll_req);268: #1;269: disable input_cycle;270: disable output_cycle;271: in_a = 0;272: out_r = 0;273:274: #DLY_SEQ_COMP;275: diff = out_s - val_seq; // compare sequence numbers276: if (˜diff [SEQ_WIDTH-1])277: begin278: valid = 0;279: out_s = XXX;280: out_d1 = XXX;281: out_d2 = XXX;282: end283:284: roll_ack = 1;

125

ckf.v

285: fork286: begin287: wait (˜roll_req);288: #1;289: roll_ack = 0;290: end291: begin292: wait (˜halt_req);293: #1;294: halt_ack = 0;295: end296: join297: end298:299: always wait (in_r & ˜valid & ˜halt_req & ˜reset)300: begin :input_cycle301: #1;302: wait (˜halt_req);303: out_s = in_s;304: out_d1 = in_d1;305: out_d2 = in_d2;306: in_a = 1;307: valid = 1;308: wait (˜in_r);309: #1;310: wait (˜halt_req);311: in_a = 0;312: end313:314: always wait (valid & ˜halt_req & ˜reset)315: begin :output_cycle316: #1;317: wait (˜halt_req);318: out_r = 1;319: wait (out_a);320: #1;321: valid = 0;322: wait (˜halt_req);323: out_r = 0;324: wait (˜out_a);325: end326:327: endmodule // ckf_buf

126

cki.v

1: // CKI (combining CKIQ and CKIF)2:3: module cki (in_seq, in_data, in_req, in_ack, arb_req, arb_ack,4: tri_seq, tri_chkid, tri_error, tri_req, out_ack,5: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);6:7: ‘include "parameter"8:9: input [SEQ_WIDTH-1:0] in_seq, val_seq;10: input [DATA_WIDTH:0] in_data;11: output [SEQ_WIDTH-1:0] tri_seq;12: output [CHKID_WIDTH-1:0] tri_chkid;13: input in_req, arb_ack, out_ack, halt_req, roll_req, reset;14: output in_ack, arb_req, tri_error, tri_req, halt_ack, roll_ack;15:16: reg halt_ack, roll_ack;17:18: wire [SEQ_WIDTH-1:0] mid_seq;19: wire [DATA_WIDTH:0] mid_data;20:21: ckiq ckiq (in_seq, in_data, in_req, in_ack,22: mid_seq, mid_data, mid_req, mid_ack,23: val_seq, halt_req, haltq_ack, roll_req, rollq_ack, reset);24: ckif ckif (mid_seq, mid_data, mid_req, mid_ack, arb_req, arb_ack,25: tri_seq, tri_chkid, tri_error, tri_req, out_ack,26: val_seq, halt_req, haltf_ack, roll_req, rollf_ack, reset);27:28: always wait (reset)29: begin30: halt_ack = 0;31: roll_ack = 0;32: wait (˜reset);33: end34:35: always @(haltq_ack or haltf_ack)36: if (haltq_ack & haltf_ack)37: halt_ack = 1;38: else if (˜haltq_ack & ˜haltf_ack)39: halt_ack = 0;40:41: always @(rollq_ack or rollf_ack)42: if (rollq_ack & rollf_ack)43: roll_ack = 1;44: else if (˜rollq_ack & ˜rollf_ack)45: roll_ack = 0;46:47: endmodule // cki48:49: // ====================================================================50: // Checker for I_bus (functional part)51:52: module ckif (in_seq, in_data, in_req, in_ack, arb_req, arb_ack,53: tri_seq, tri_chkid, tri_error, tri_req, out_ack,54: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);55:56: ‘include "parameter"57:58: input [SEQ_WIDTH-1:0] in_seq, val_seq;59: input [DATA_WIDTH:0] in_data;60: output [SEQ_WIDTH-1:0] tri_seq;61: output [CHKID_WIDTH-1:0] tri_chkid;62: input in_req, arb_ack, out_ack, halt_req, roll_req, reset;63: output in_ack, arb_req, tri_error, tri_req, halt_ack, roll_ack;64:65: reg [SEQ_WIDTH-1:0] out_seq;66: reg in_ack, arb_req, out_error, out_req, halt_ack, roll_ack;67:68: reg [SEQ_WIDTH-1:0] diff;69: reg [DATA_WIDTH:0] data_buf;70: reg [5:0] loop;71: reg valid, parity;

127

cki.v

72:73: assign tri_seq = arb_ack ? out_seq : ZZZ;74: assign tri_chkid = arb_ack ? ID_CKI : ZZZ;75: assign tri_error = arb_ack ? out_error : ZZZ;76: assign tri_req = arb_ack ? out_req : ZZZ;77:78: always wait (reset)79: begin80: disable rollback_cycle;81: disable latch_cycle;82: disable output_cycle;83: out_seq = XXX;84: in_ack = 0;85: arb_req = 0;86: out_error = XXX;87: out_req = 0;88: halt_ack = 0;89: roll_ack = 0;90: valid = 0;91: wait (˜reset);92: end93:94: always wait (halt_req & ˜reset)95: begin :rollback_cycle96: #1;97: halt_ack = 1;98: wait (roll_req);99: #1;100: disable latch_cycle;101: disable output_cycle;102: in_ack = 0;103: arb_req = 0;104: out_req = 0;105:106: #DLY_SEQ_COMP;107: diff = out_seq - val_seq; // compare sequence numbers108: if (˜diff [SEQ_WIDTH-1])109: begin110: valid = 0;111: out_seq = XXX;112: out_error = XXX;113: end114:115: roll_ack = 1;116: fork117: begin118: wait (˜roll_req);119: #1;120: roll_ack = 0;121: end122: begin123: wait (˜halt_req);124: #1;125: halt_ack = 0;126: end127: join128: end129:130: always wait (in_req & ˜valid & ˜halt_req & ˜reset)131: begin :latch_cycle132: #1;133: wait (˜halt_req);134: out_seq = in_seq;135: data_buf = in_data;136: in_ack = 1;137: valid = 1;138: wait (˜in_req);139: #1;140: wait (˜halt_req);141: in_ack = 0;142: end

128

cki.v

143:144: always wait (valid & ˜halt_req & ˜reset)145: begin :output_cycle146: #1;147: parity = 0;148: for (loop=0; loop<1+DATA_WIDTH; loop=loop+1) // calculate parity149: parity = parity ˆ data_buf [loop];150: if (parity === 1’bx) parity = 1;151:152: #DLY_CKI_CHK;153: out_error = parity;154: arb_req = 1;155: wait (arb_ack);156: #1;157: wait (˜halt_req);158: out_req = 1;159: wait (out_ack);160: #1;161: valid = 0;162: arb_req = 0;163: wait (˜halt_req);164: out_req = 0;165: wait (˜out_ack & ˜arb_ack);166: end167:168: endmodule // ckif169:170: // ====================================================================171: // Queue for I_bus Checker172:173: module ckiq (in_seq, in_data, in_req, in_ack,174: out_seq, out_data, out_req, out_ack,175: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);176:177: ‘include "parameter"178:179: input [SEQ_WIDTH-1:0] in_seq, val_seq;180: input [DATA_WIDTH:0] in_data;181: output [SEQ_WIDTH-1:0] out_seq;182: output [DATA_WIDTH:0] out_data;183: input in_req, out_ack, halt_req, roll_req, reset;184: output in_ack, out_req, halt_ack, roll_ack;185:186: reg halt_ack, roll_ack;187:188: wire [SEQ_WIDTH-1:0] seq1;189: wire [DATA_WIDTH:0] data1;190:191: cki_buf buf1 (in_seq, in_data, in_req, in_ack, seq1, data1, req1, ack1,192: val_seq, halt_req, halt_a1, roll_req, roll_a1, reset);193: cki_buf buf2 (seq1, data1, req1, ack1, out_seq, out_data, out_req, out_ack,194: val_seq, halt_req, halt_a2, roll_req, roll_a2, reset);195:196: always wait (reset)197: begin198: halt_ack = 0;199: roll_ack = 0;200: wait (˜reset);201: end202:203: always @(halt_a1 or halt_a2)204: if (halt_a1 & halt_a2)205: halt_ack = 1;206: else if (˜halt_a1 & ˜halt_a2)207: halt_ack = 0;208:209: always @(roll_a1 or roll_a2)210: if (roll_a1 & roll_a2)211: roll_ack = 1;212: else if (˜roll_a1 & ˜roll_a2)213: roll_ack = 0;

129

cki.v

214:215: endmodule // ckiq216:217: // ====================================================================218:219: module cki_buf (in_s, in_d, in_r, in_a, out_s, out_d, out_r, out_a,220: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);221:222: ‘include "parameter"223:224: input [SEQ_WIDTH-1:0] in_s, val_seq;225: input [DATA_WIDTH:0] in_d;226: output [SEQ_WIDTH-1:0] out_s;227: output [DATA_WIDTH:0] out_d;228: input in_r, out_a, halt_req, roll_req, reset;229: output in_a, out_r, halt_ack, roll_ack;230:231: reg [SEQ_WIDTH-1:0] out_s;232: reg [DATA_WIDTH:0] out_d;233: reg in_a, out_r, halt_ack, roll_ack;234:235: reg [SEQ_WIDTH-1:0] diff;236: reg valid;237:238: always wait (reset)239: begin240: disable rollback_cycle;241: disable input_cycle;242: disable output_cycle;243: out_s = XXX;244: out_d = XXX;245: in_a = 0;246: out_r = 0;247: halt_ack = 0;248: roll_ack = 0;249: valid = 0;250: wait (˜reset);251: end252:253: always wait (halt_req & ˜reset)254: begin :rollback_cycle255: #1;256: halt_ack = 1;257: wait (roll_req);258: #1;259: disable input_cycle;260: disable output_cycle;261: in_a = 0;262: out_r = 0;263:264: #DLY_SEQ_COMP;265: diff = out_s - val_seq; // compare sequence numbers266: if (˜diff [SEQ_WIDTH-1])267: begin268: valid = 0;269: out_s = XXX;270: out_d = XXX;271: end272:273: roll_ack = 1;274: fork275: begin276: wait (˜roll_req);277: #1;278: roll_ack = 0;279: end280: begin281: wait (˜halt_req);282: #1;283: halt_ack = 0;284: end

130

cki.v

285: join286: end287:288: always wait (in_r & ˜valid & ˜halt_req & ˜reset)289: begin :input_cycle290: #1;291: wait (˜halt_req);292: out_s = in_s;293: out_d = in_d;294: in_a = 1;295: valid = 1;296: wait (˜in_r);297: #1;298: wait (˜halt_req);299: in_a = 0;300: end301:302: always wait (valid & ˜halt_req & ˜reset)303: begin :output_cycle304: #1;305: wait (˜halt_req);306: out_r = 1;307: wait (out_a);308: #1;309: valid = 0;310: wait (˜halt_req);311: out_r = 0;312: wait (˜out_a);313: end314:315: endmodule // cki_buf

131

ckm.v

1: // CKM (combining CKMQ and CKMF)2:3: module ckm (in_seq, in_addr, in_data, in_req, in_ack, arb_req, arb_ack,4: tri_seq, tri_chkid, tri_error, tri_req, out_ack,5: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);6:7: ‘include "parameter"8:9: input [SEQ_WIDTH-1:0] in_seq, val_seq;10: input [DATA_WIDTH:0] in_addr, in_data;11: output [SEQ_WIDTH-1:0] tri_seq;12: output [CHKID_WIDTH-1:0] tri_chkid;13: input in_req, arb_ack, out_ack, halt_req, roll_req, reset;14: output in_ack, arb_req, tri_error, tri_req, halt_ack, roll_ack;15:16: reg halt_ack, roll_ack;17:18: wire [ADDR_WIDTH:0] adpt_addr = in_addr; // addr "adaptor"19: wire [SEQ_WIDTH-1:0] mid_seq;20: wire [ADDR_WIDTH:0] mid_addr;21: wire [DATA_WIDTH:0] mid_data;22:23: ckmq ckmq (in_seq, adpt_addr, in_data, in_req, in_ack,24: mid_seq, mid_addr, mid_data, mid_req, mid_ack,25: val_seq, halt_req, haltq_ack, roll_req, rollq_ack, reset);26: ckmf ckmf (mid_seq, mid_addr, mid_data, mid_req, mid_ack, arb_req, arb_ack,27: tri_seq, tri_chkid, tri_error, tri_req, out_ack,28: val_seq, halt_req, haltf_ack, roll_req, rollf_ack, reset);29:30: always wait (reset)31: begin32: halt_ack = 0;33: roll_ack = 0;34: wait (˜reset);35: end36:37: always @(haltq_ack or haltf_ack)38: if (haltq_ack & haltf_ack)39: halt_ack = 1;40: else if (˜haltq_ack & ˜haltf_ack)41: halt_ack = 0;42:43: always @(rollq_ack or rollf_ack)44: if (rollq_ack & rollf_ack)45: roll_ack = 1;46: else if (˜rollq_ack & ˜rollf_ack)47: roll_ack = 0;48:49: endmodule // ckm50:51: // ====================================================================52: // Checker for Memory Operations (functional part)53:54: module ckmf (in_seq, in_addr, in_data, in_req, in_ack, arb_req, arb_ack,55: tri_seq, tri_chkid, tri_error, tri_req, out_ack,56: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);57:58: ‘include "parameter"59:60: input [SEQ_WIDTH-1:0] in_seq, val_seq;61: input [ADDR_WIDTH:0] in_addr;62: input [DATA_WIDTH:0] in_data;63: output [SEQ_WIDTH-1:0] tri_seq;64: output [CHKID_WIDTH-1:0] tri_chkid;65: input in_req, arb_ack, out_ack, halt_req, roll_req, reset;66: output in_ack, arb_req, tri_error, tri_req, halt_ack, roll_ack;67:68: reg [SEQ_WIDTH-1:0] out_seq;69: reg in_ack, arb_req, out_error, out_req, halt_ack, roll_ack;70:71: reg [SEQ_WIDTH-1:0] diff;

132

ckm.v

72: reg [ADDR_WIDTH:0] addr_buf;73: reg [DATA_WIDTH:0] data_buf;74: reg [5:0] loop;75: reg valid, parity1, parity2;76:77: assign tri_seq = arb_ack ? out_seq : ZZZ;78: assign tri_chkid = arb_ack ? ID_CKM : ZZZ;79: assign tri_error = arb_ack ? out_error : ZZZ;80: assign tri_req = arb_ack ? out_req : ZZZ;81:82: always wait (reset)83: begin84: disable rollback_cycle;85: disable latch_cycle;86: disable output_cycle;87: out_seq = XXX;88: in_ack = 0;89: arb_req = 0;90: out_error = XXX;91: out_req = 0;92: halt_ack = 0;93: roll_ack = 0;94: valid = 0;95: wait (˜reset);96: end97:98: always wait (halt_req & ˜reset)99: begin :rollback_cycle100: #1;101: halt_ack = 1;102: wait (roll_req);103: #1;104: disable latch_cycle;105: disable output_cycle;106: in_ack = 0;107: arb_req = 0;108: out_req = 0;109:110: #DLY_SEQ_COMP;111: diff = out_seq - val_seq; // compare sequence numbers112: if (˜diff [SEQ_WIDTH-1])113: begin114: valid = 0;115: out_seq = XXX;116: out_error = XXX;117: end118:119: roll_ack = 1;120: fork121: begin122: wait (˜roll_req);123: #1;124: roll_ack = 0;125: end126: begin127: wait (˜halt_req);128: #1;129: halt_ack = 0;130: end131: join132: end133:134: always wait (in_req & ˜valid & ˜halt_req & ˜reset)135: begin :latch_cycle136: #1;137: wait (˜halt_req);138: out_seq = in_seq;139: addr_buf = in_addr;140: data_buf = in_data;141: in_ack = 1;142: valid = 1;

133

ckm.v

143: wait (˜in_req);144: #1;145: wait (˜halt_req);146: in_ack = 0;147: end148:149: always wait (valid & ˜halt_req & ˜reset)150: begin :output_cycle151: #1;152: parity1 = 0;153: for (loop=0; loop<1+ADDR_WIDTH; loop=loop+1) // calculate parity154: parity1 = parity1 ˆ addr_buf [loop];155: if (parity1 === 1’bx) parity1 = 1;156: parity2 = 0;157: for (loop=0; loop<1+DATA_WIDTH; loop=loop+1) // calculate parity158: parity2 = parity2 ˆ data_buf [loop];159: if (parity2 === 1’bx) parity2 = 1;160:161: #DLY_CKM_CHK;162: out_error = parity1 | parity2;163: arb_req = 1;164: wait (arb_ack);165: #1;166: wait (˜halt_req);167: out_req = 1;168: wait (out_ack);169: #1;170: valid = 0;171: arb_req = 0;172: wait (˜halt_req);173: out_req = 0;174: wait (˜out_ack & ˜arb_ack);175: end176:177: endmodule // ckmf178:179: // ====================================================================180: // Queue for Memory Operations Checker181:182: module ckmq (in_seq, in_addr, in_data, in_req, in_ack,183: out_seq, out_addr, out_data, out_req, out_ack,184: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);185:186: ‘include "parameter"187:188: input [SEQ_WIDTH-1:0] in_seq, val_seq;189: input [ADDR_WIDTH:0] in_addr;190: input [DATA_WIDTH:0] in_data;191: output [SEQ_WIDTH-1:0] out_seq;192: output [ADDR_WIDTH:0] out_addr;193: output [DATA_WIDTH:0] out_data;194: input in_req, out_ack, halt_req, roll_req, reset;195: output in_ack, out_req, halt_ack, roll_ack;196:197: reg halt_ack, roll_ack;198:199: wire [SEQ_WIDTH-1:0] seq1;200: wire [ADDR_WIDTH:0] addr1;201: wire [DATA_WIDTH:0] data1;202:203: ckm_buf buf1 (in_seq, in_addr, in_data, in_req, in_ack,204: seq1, addr1, data1, req1, ack1,205: val_seq, halt_req, halt_a1, roll_req, roll_a1, reset);206: ckm_buf buf2 (seq1, addr1, data1, req1, ack1,207: out_seq, out_addr, out_data, out_req, out_ack,208: val_seq, halt_req, halt_a2, roll_req, roll_a2, reset);209:210: always wait (reset)211: begin212: halt_ack = 0;213: roll_ack = 0;

134

ckm.v

214: wait (˜reset);215: end216:217: always @(halt_a1 or halt_a2)218: if (halt_a1 & halt_a2)219: halt_ack = 1;220: else if (˜halt_a1 & ˜halt_a2)221: halt_ack = 0;222:223: always @(roll_a1 or roll_a2)224: if (roll_a1 & roll_a2)225: roll_ack = 1;226: else if (˜roll_a1 & ˜roll_a2)227: roll_ack = 0;228:229: endmodule // ckmq230:231: // ====================================================================232:233: module ckm_buf (in_s, in_d1, in_d2, in_r, in_a,234: out_s, out_d1, out_d2, out_r, out_a,235: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);236:237: ‘include "parameter"238:239: input [SEQ_WIDTH-1:0] in_s, val_seq;240: input [ADDR_WIDTH:0] in_d1;241: input [DATA_WIDTH:0] in_d2;242: output [SEQ_WIDTH-1:0] out_s;243: output [ADDR_WIDTH:0] out_d1;244: output [DATA_WIDTH:0] out_d2;245: input in_r, out_a, halt_req, roll_req, reset;246: output in_a, out_r, halt_ack, roll_ack;247:248: reg [SEQ_WIDTH-1:0] out_s;249: reg [ADDR_WIDTH:0] out_d1;250: reg [DATA_WIDTH:0] out_d2;251: reg in_a, out_r, halt_ack, roll_ack;252:253: reg [SEQ_WIDTH-1:0] diff;254: reg valid;255:256: always wait (reset)257: begin258: disable rollback_cycle;259: disable input_cycle;260: disable output_cycle;261: out_s = XXX;262: out_d1 = XXX;263: out_d2 = XXX;264: in_a = 0;265: out_r = 0;266: halt_ack = 0;267: roll_ack = 0;268: valid = 0;269: wait (˜reset);270: end271:272: always wait (halt_req & ˜reset)273: begin :rollback_cycle274: #1;275: halt_ack = 1;276: wait (roll_req);277: #1;278: disable input_cycle;279: disable output_cycle;280: in_a = 0;281: out_r = 0;282:283: #DLY_SEQ_COMP;284: diff = out_s - val_seq; // compare sequence numbers

135

ckm.v

285: if (˜diff [SEQ_WIDTH-1])286: begin287: valid = 0;288: out_s = XXX;289: out_d1 = XXX;290: out_d2 = XXX;291: end292:293: roll_ack = 1;294: fork295: begin296: wait (˜roll_req);297: #1;298: roll_ack = 0;299: end300: begin301: wait (˜halt_req);302: #1;303: halt_ack = 0;304: end305: join306: end307:308: always wait (in_r & ˜valid & ˜halt_req & ˜reset)309: begin :input_cycle310: #1;311: wait (˜halt_req);312: out_s = in_s;313: out_d1 = in_d1;314: out_d2 = in_d2;315: in_a = 1;316: valid = 1;317: wait (˜in_r);318: #1;319: wait (˜halt_req);320: in_a = 0;321: end322:323: always wait (valid & ˜halt_req & ˜reset)324: begin :output_cycle325: #1;326: wait (˜halt_req);327: out_r = 1;328: wait (out_a);329: #1;330: valid = 0;331: wait (˜halt_req);332: out_r = 0;333: wait (˜out_a);334: end335:336: endmodule // ckm_buf

136

ckr.v

1: // CKR (combining CKRQ and CKRF)2:3: module ckr (in_seq, in_data, in_req, in_ack, arb_req, arb_ack,4: tri_seq, tri_chkid, tri_error, tri_req, out_ack,5: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);6:7: ‘include "parameter"8:9: input [SEQ_WIDTH-1:0] in_seq, val_seq;10: input [DATA_WIDTH:0] in_data;11: output [SEQ_WIDTH-1:0] tri_seq;12: output [CHKID_WIDTH-1:0] tri_chkid;13: input in_req, arb_ack, out_ack, halt_req, roll_req, reset;14: output in_ack, arb_req, tri_error, tri_req, halt_ack, roll_ack;15:16: reg halt_ack, roll_ack;17:18: wire [SEQ_WIDTH-1:0] mid_seq;19: wire [DATA_WIDTH:0] mid_data;20:21: ckrq ckrq (in_seq, in_data, in_req, in_ack,22: mid_seq, mid_data, mid_req, mid_ack,23: val_seq, halt_req, haltq_ack, roll_req, rollq_ack, reset);24: ckrf ckrf (mid_seq, mid_data, mid_req, mid_ack, arb_req, arb_ack,25: tri_seq, tri_chkid, tri_error, tri_req, out_ack,26: val_seq, halt_req, haltf_ack, roll_req, rollf_ack, reset);27:28: always wait (reset)29: begin30: halt_ack = 0;31: roll_ack = 0;32: wait (˜reset);33: end34:35: always @(haltq_ack or haltf_ack)36: if (haltq_ack & haltf_ack)37: halt_ack = 1;38: else if (˜haltq_ack & ˜haltf_ack)39: halt_ack = 0;40:41: always @(rollq_ack or rollf_ack)42: if (rollq_ack & rollf_ack)43: roll_ack = 1;44: else if (˜rollq_ack & ˜rollf_ack)45: roll_ack = 0;46:47: endmodule // ckr48:49: // ====================================================================50: // Checker for R_bus (functional part)51:52: module ckrf (in_seq, in_data, in_req, in_ack, arb_req, arb_ack,53: tri_seq, tri_chkid, tri_error, tri_req, out_ack,54: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);55:56: ‘include "parameter"57:58: input [SEQ_WIDTH-1:0] in_seq, val_seq;59: input [DATA_WIDTH:0] in_data;60: output [SEQ_WIDTH-1:0] tri_seq;61: output [CHKID_WIDTH-1:0] tri_chkid;62: input in_req, arb_ack, out_ack, halt_req, roll_req, reset;63: output in_ack, arb_req, tri_error, tri_req, halt_ack, roll_ack;64:65: reg [SEQ_WIDTH-1:0] out_seq;66: reg in_ack, arb_req, out_error, out_req, halt_ack, roll_ack;67:68: reg [SEQ_WIDTH-1:0] diff;69: reg [DATA_WIDTH:0] data_buf;70: reg [5:0] loop;71: reg valid, parity;

137

ckr.v

72:73: assign tri_seq = arb_ack ? out_seq : ZZZ;74: assign tri_chkid = arb_ack ? ID_CKR : ZZZ;75: assign tri_error = arb_ack ? out_error : ZZZ;76: assign tri_req = arb_ack ? out_req : ZZZ;77:78: always wait (reset)79: begin80: disable rollback_cycle;81: disable latch_cycle;82: disable output_cycle;83: out_seq = XXX;84: in_ack = 0;85: arb_req = 0;86: out_error = XXX;87: out_req = 0;88: halt_ack = 0;89: roll_ack = 0;90: valid = 0;91: wait (˜reset);92: end93:94: always wait (halt_req & ˜reset)95: begin :rollback_cycle96: #1;97: halt_ack = 1;98: wait (roll_req);99: #1;100: disable latch_cycle;101: disable output_cycle;102: in_ack = 0;103: arb_req = 0;104: out_req = 0;105:106: #DLY_SEQ_COMP;107: diff = out_seq - val_seq; // compare sequence numbers108: if (˜diff [SEQ_WIDTH-1])109: begin110: valid = 0;111: out_seq = XXX;112: out_error = XXX;113: end114:115: roll_ack = 1;116: fork117: begin118: wait (˜roll_req);119: #1;120: roll_ack = 0;121: end122: begin123: wait (˜halt_req);124: #1;125: halt_ack = 0;126: end127: join128: end129:130: always wait (in_req & ˜valid & ˜halt_req & ˜reset)131: begin :latch_cycle132: #1;133: wait (˜halt_req);134: out_seq = in_seq;135: data_buf = in_data;136: in_ack = 1;137: valid = 1;138: wait (˜in_req);139: #1;140: wait (˜halt_req);141: in_ack = 0;142: end

138

ckr.v

143:144: always wait (valid & ˜halt_req & ˜reset)145: begin :output_cycle146: #1;147: parity = 0;148: for (loop=0; loop<1+DATA_WIDTH; loop=loop+1) // calculate parity149: parity = parity ˆ data_buf [loop];150: if (parity === 1’bx) parity = 1;151:152: #DLY_CKR_CHK;153: out_error = parity;154: arb_req = 1;155: wait (arb_ack);156: #1;157: wait (˜halt_req);158: out_req = 1;159: wait (out_ack);160: #1;161: valid = 0;162: arb_req = 0;163: wait (˜halt_req);164: out_req = 0;165: wait (˜out_ack & ˜arb_ack);166: end167:168: endmodule // ckrf169:170: // ====================================================================171: // Queue for R_bus Checker172:173: module ckrq (in_seq, in_data, in_req, in_ack,174: out_seq, out_data, out_req, out_ack,175: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);176:177: ‘include "parameter"178:179: input [SEQ_WIDTH-1:0] in_seq, val_seq;180: input [DATA_WIDTH:0] in_data;181: output [SEQ_WIDTH-1:0] out_seq;182: output [DATA_WIDTH:0] out_data;183: input in_req, out_ack, halt_req, roll_req, reset;184: output in_ack, out_req, halt_ack, roll_ack;185:186: reg halt_ack, roll_ack;187:188: wire [SEQ_WIDTH-1:0] seq1;189: wire [DATA_WIDTH:0] data1;190:191: ckr_buf buf1 (in_seq, in_data, in_req, in_ack, seq1, data1, req1, ack1,192: val_seq, halt_req, halt_a1, roll_req, roll_a1, reset);193: ckr_buf buf2 (seq1, data1, req1, ack1, out_seq, out_data, out_req, out_ack,194: val_seq, halt_req, halt_a2, roll_req, roll_a2, reset);195:196: always wait (reset)197: begin198: halt_ack = 0;199: roll_ack = 0;200: wait (˜reset);201: end202:203: always @(halt_a1 or halt_a2)204: if (halt_a1 & halt_a2)205: halt_ack = 1;206: else if (˜halt_a1 & ˜halt_a2)207: halt_ack = 0;208:209: always @(roll_a1 or roll_a2)210: if (roll_a1 & roll_a2)211: roll_ack = 1;212: else if (˜roll_a1 & ˜roll_a2)213: roll_ack = 0;

139

ckr.v

214:215: endmodule // ckrq216:217: // ====================================================================218:219: module ckr_buf (in_s, in_d, in_r, in_a, out_s, out_d, out_r, out_a,220: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);221:222: ‘include "parameter"223:224: input [SEQ_WIDTH-1:0] in_s, val_seq;225: input [DATA_WIDTH:0] in_d;226: output [SEQ_WIDTH-1:0] out_s;227: output [DATA_WIDTH:0] out_d;228: input in_r, out_a, halt_req, roll_req, reset;229: output in_a, out_r, halt_ack, roll_ack;230:231: reg [SEQ_WIDTH-1:0] out_s;232: reg [DATA_WIDTH:0] out_d;233: reg in_a, out_r, halt_ack, roll_ack;234:235: reg [SEQ_WIDTH-1:0] diff;236: reg valid;237:238: always wait (reset)239: begin240: disable rollback_cycle;241: disable input_cycle;242: disable output_cycle;243: out_s = XXX;244: out_d = XXX;245: in_a = 0;246: out_r = 0;247: halt_ack = 0;248: roll_ack = 0;249: valid = 0;250: wait (˜reset);251: end252:253: always wait (halt_req & ˜reset)254: begin :rollback_cycle255: #1;256: halt_ack = 1;257: wait (roll_req);258: #1;259: disable input_cycle;260: disable output_cycle;261: in_a = 0;262: out_r = 0;263:264: #DLY_SEQ_COMP;265: diff = out_s - val_seq; // compare sequence numbers266: if (˜diff [SEQ_WIDTH-1])267: begin268: valid = 0;269: out_s = XXX;270: out_d = XXX;271: end272:273: roll_ack = 1;274: fork275: begin276: wait (˜roll_req);277: #1;278: roll_ack = 0;279: end280: begin281: wait (˜halt_req);282: #1;283: halt_ack = 0;284: end

140

ckr.v

285: join286: end287:288: always wait (in_r & ˜valid & ˜halt_req & ˜reset)289: begin :input_cycle290: #1;291: wait (˜halt_req);292: out_s = in_s;293: out_d = in_d;294: in_a = 1;295: valid = 1;296: wait (˜in_r);297: #1;298: wait (˜halt_req);299: in_a = 0;300: end301:302: always wait (valid & ˜halt_req & ˜reset)303: begin :output_cycle304: #1;305: wait (˜halt_req);306: out_r = 1;307: wait (out_a);308: #1;309: valid = 0;310: wait (˜halt_req);311: out_r = 0;312: wait (˜out_a);313: end314:315: endmodule // ckr_buf

141

dmem.v

1: // Data Memory2:3: module dmem (addr, data, rw_mode, retry, in_req, in_ack,4: tri_out_req, out_ack,5: halt_req, halt_ack, roll_req, roll_ack, reset);6:7: ‘include "parameter"8:9: input [ADDR_WIDTH:0] addr;10: inout [DATA_WIDTH:0] data;11: input rw_mode, retry, in_req, out_ack, halt_req, roll_req, reset;12: output in_ack, tri_out_req, halt_ack, roll_ack;13:14: reg in_ack, tri_out_req, halt_ack, roll_ack;15:16: reg [DATA_WIDTH:0] data_out, dmemory [0:DMEM_SIZE-1];17: reg [ADDR_WIDTH:0] loop; // one extra bit for loop termination18: reg [5:0] loop1; // DATA_WIDTH = 32 bits max19: reg parity;20:21: assign data = (rw_mode & in_req) ? data_out : ZZZ;22:23: always wait (reset)24: begin25: disable rollback_cycle;26: disable memory_cycle;27: data_out = XXX;28: in_ack = 0;29: tri_out_req = ZZZ;30: halt_ack = 0;31: roll_ack = 0;32: for (loop = 0; loop < DMEM_SIZE; loop = loop + 1)33: dmemory [loop] = XXX;34: $readmemh ("inst.hex", dmemory, 0);35: wait (˜reset);36: end37:38: always wait (halt_req & ˜reset)39: begin :rollback_cycle40: #1;41: halt_ack = 1;42: wait (roll_req);43: #1;44: disable memory_cycle;45: data_out = XXX;46: in_ack = 0;47: tri_out_req = ZZZ;48:49: roll_ack = 1;50: fork51: begin52: wait (˜roll_req);53: #1;54: roll_ack = 0;55: end56: begin57: wait (˜halt_req);58: #1;59: halt_ack = 0;60: end61: join62: end63:64: // Input is not acknowledged until output is acknowledged so that65: // register and sequence number can be routed from DWB to DQ without66: // going through DMEM.67:68: always wait (in_req & ˜halt_req & ˜reset)69: begin :memory_cycle70: if (rw_mode) // 1 for read71: begin

142

dmem.v

72: #DLY_DMEM_RD;73: data_out = dmemory [addr[ADDR_WIDTH-1:ADDR_IGNORE]];74:75: if (retry) // correct parity error76: begin77: parity = 0;78: for (loop1=0; loop1<DATA_WIDTH; loop1=loop1+1)79: parity = parity ˆ data_out [loop1];80: data_out [DATA_WIDTH] = parity;81: dmemory [addr[ADDR_WIDTH-1:ADDR_IGNORE]] = data_out;82: end83:84: #1;85: wait (˜halt_req);86: tri_out_req = 1;87: wait (out_ack);88: #1;89: in_ack = 1;90: wait (˜halt_req);91: tri_out_req = ZZZ; // tri-state it92: wait (˜in_req);93: #1;94: in_ack = 0;95: wait (˜out_ack);96: end97: else // 0 for write98: begin99: #DLY_DMEM_WR;100: wait (˜halt_req);101: dmemory [addr[ADDR_WIDTH-1:ADDR_IGNORE]] = data;102: in_ack = 1;103: wait (˜in_req);104: #1;105: wait (˜halt_req);106: in_ack = 0;107: end108: end109:110: endmodule // dmem

143

dq.v

1: // Data Queue (combining DQQ and DQBUS)2:3: module dq (in_seq, in_reg, in_data, in_req, in_ack, arb_req, arb_ack,4: tri_seq, tri_reg, tri_data, tri_req, out_ack,5: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);6:7: ‘include "parameter"8:9: input [SEQ_WIDTH-1:0] in_seq, val_seq;10: input [REG_WIDTH-1:0] in_reg;11: input [DATA_WIDTH:0] in_data;12: output [SEQ_WIDTH-1:0] tri_seq;13: output [REG_WIDTH-1:0] tri_reg;14: output [DATA_WIDTH:0] tri_data;15: input in_req, arb_ack, out_ack, halt_req, roll_req, reset;16: output in_ack, arb_req, tri_req, halt_ack, roll_ack;17:18: reg halt_ack, roll_ack;19:20: wire [SEQ_WIDTH-1:0] mid_seq;21: wire [REG_WIDTH-1:0] mid_reg;22: wire [DATA_WIDTH:0] mid_data;23:24: dqq dqq (in_seq, in_reg, in_data, in_req, in_ack,25: mid_seq, mid_reg, mid_data, mid_req, mid_ack,26: val_seq, halt_req, haltq_ack, roll_req, rollq_ack, reset);27: dqbus dqbus (mid_seq, mid_reg, mid_data, mid_req, mid_ack, arb_req, arb_ack,28: tri_seq, tri_reg, tri_data, tri_req, out_ack,29: val_seq, halt_req, haltb_ack, roll_req, rollb_ack, reset);30:31: always wait (reset)32: begin33: halt_ack = 0;34: roll_ack = 0;35: wait (˜reset);36: end37:38: always @(haltq_ack or haltb_ack)39: if (haltq_ack & haltb_ack)40: halt_ack = 1;41: else if (˜haltq_ack & ˜haltb_ack)42: halt_ack = 0;43:44: always @(rollq_ack or rollb_ack)45: if (rollq_ack & rollb_ack)46: roll_ack = 1;47: else if (˜rollq_ack & ˜rollb_ack)48: roll_ack = 0;49:50: endmodule // dq51:52: // ====================================================================53: // Data Queue Bus Handler54:55: module dqbus (in_seq, in_reg, in_data, in_req, in_ack, arb_req, arb_ack,56: tri_seq, tri_reg, tri_data, tri_req, out_ack,57: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);58:59: ‘include "parameter"60:61: input [SEQ_WIDTH-1:0] in_seq, val_seq;62: input [REG_WIDTH-1:0] in_reg;63: input [DATA_WIDTH:0] in_data;64: output [SEQ_WIDTH-1:0] tri_seq;65: output [REG_WIDTH-1:0] tri_reg;66: output [DATA_WIDTH:0] tri_data;67: input in_req, arb_ack, out_ack, halt_req, roll_req, reset;68: output in_ack, arb_req, tri_req, halt_ack, roll_ack;69:70: reg [SEQ_WIDTH-1:0] out_seq;71: reg [REG_WIDTH-1:0] out_reg;

144

dq.v

72: reg [DATA_WIDTH:0] out_data;73: reg in_ack, arb_req, out_req, halt_ack, roll_ack;74:75: reg [SEQ_WIDTH-1:0] diff;76: reg valid;77:78: assign tri_seq = arb_ack ? out_seq : ZZZ;79: assign tri_reg = arb_ack ? out_reg : ZZZ;80: assign tri_data = arb_ack ? out_data : ZZZ;81: assign tri_req = arb_ack ? out_req : ZZZ;82:83: always wait (reset)84: begin85: disable rollback_cycle;86: disable latch_cycle;87: disable output_cycle;88: out_seq = XXX;89: out_reg = XXX;90: out_data = XXX;91: in_ack = 0;92: arb_req = 0;93: out_req = 0;94: halt_ack = 0;95: roll_ack = 0;96: valid = 0;97: wait (˜reset);98: end99:100: always wait (halt_req & ˜reset)101: begin :rollback_cycle102: #1;103: halt_ack = 1;104: wait (roll_req);105: #1;106: disable latch_cycle;107: disable output_cycle;108: in_ack = 0;109: arb_req = 0;110: out_req = 0;111:112: #DLY_SEQ_COMP;113: diff = out_seq - val_seq; // compare sequence numbers114: if (˜diff [SEQ_WIDTH-1])115: begin116: valid = 0;117: out_seq = XXX;118: out_reg = XXX;119: out_data = XXX;120: end121:122: roll_ack = 1;123: fork124: begin125: wait (˜roll_req);126: #1;127: roll_ack = 0;128: end129: begin130: wait (˜halt_req);131: #1;132: halt_ack = 0;133: end134: join135: end136:137: always wait (in_req & ˜valid & ˜halt_req & ˜reset)138: begin :latch_cycle139: #1;140: wait (˜halt_req);141: out_seq = in_seq;142: out_reg = in_reg;

145

dq.v

143: out_data = in_data;144: in_ack = 1;145: valid = 1;146: wait (˜in_req);147: #1;148: wait (˜halt_req);149: in_ack = 0;150: end151:152: always wait (valid & ˜halt_req & ˜reset)153: begin :output_cycle154: #1;155: arb_req = 1;156: wait (arb_ack);157: #1;158: wait (˜halt_req);159: out_req = 1;160: wait (out_ack);161: #1;162: valid = 0;163: arb_req = 0;164: wait (˜halt_req);165: out_req = 0;166: wait (˜out_ack & ˜arb_ack);167: end168:169: endmodule // dqbus170:171: // ====================================================================172: // Data Queue (real queue)173:174: module dqq (in_seq, in_reg, in_data, in_req, in_ack,175: out_seq, out_reg, out_data, out_req, out_ack,176: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);177:178: ‘include "parameter"179:180: input [SEQ_WIDTH-1:0] in_seq, val_seq;181: input [REG_WIDTH-1:0] in_reg;182: input [DATA_WIDTH:0] in_data;183: output [SEQ_WIDTH-1:0] out_seq;184: output [REG_WIDTH-1:0] out_reg;185: output [DATA_WIDTH:0] out_data;186: input in_req, out_ack, halt_req, roll_req, reset;187: output in_ack, out_req, halt_ack, roll_ack;188:189: reg halt_ack, roll_ack;190:191: wire [SEQ_WIDTH-1:0] seq1;192: wire [REG_WIDTH-1:0] reg1;193: wire [DATA_WIDTH:0] data1;194:195: dq_buffer buf1 (in_seq, in_reg, in_data, in_req, in_ack,196: seq1, reg1, data1, req1, ack1,197: val_seq, halt_req, halt_a1, roll_req, roll_a1, reset);198: dq_buffer buf2 (seq1, reg1, data1, req1, ack1,199: out_seq, out_reg, out_data, out_req, out_ack,200: val_seq, halt_req, halt_a2, roll_req, roll_a2, reset);201:202: always wait (reset)203: begin204: halt_ack = 0;205: roll_ack = 0;206: wait (˜reset);207: end208:209: always @(halt_a1 or halt_a2)210: if (halt_a1 & halt_a2)211: halt_ack = 1;212: else if (˜halt_a1 & ˜halt_a2)213: halt_ack = 0;

146

dq.v

214:215: always @(roll_a1 or roll_a2)216: if (roll_a1 & roll_a2)217: roll_ack = 1;218: else if (˜roll_a1 & ˜roll_a2)219: roll_ack = 0;220:221: endmodule // ckiq222:223: // ====================================================================224:225: module dq_buffer (in_s, in_g, in_d, in_r, in_a,226: out_s, out_g, out_d, out_r, out_a,227: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);228:229: ‘include "parameter"230:231: input [SEQ_WIDTH-1:0] in_s, val_seq;232: input [REG_WIDTH-1:0] in_g;233: input [DATA_WIDTH:0] in_d;234: output [SEQ_WIDTH-1:0] out_s;235: output [REG_WIDTH-1:0] out_g;236: output [DATA_WIDTH:0] out_d;237: input in_r, out_a, halt_req, roll_req, reset;238: output in_a, out_r, halt_ack, roll_ack;239:240: ‘ifdef BEHAV_DQ241:242: // ====================================================================243: // behavioral level244:245: reg [SEQ_WIDTH-1:0] out_s;246: reg [REG_WIDTH-1:0] out_g;247: reg [DATA_WIDTH:0] out_d;248: reg in_a, out_r, halt_ack, roll_ack;249:250: reg [SEQ_WIDTH-1:0] diff;251: reg valid;252:253: always wait (reset)254: begin255: disable rollback_cycle;256: disable input_cycle;257: disable output_cycle;258: out_s = XXX;259: out_g = XXX;260: out_d = XXX;261: in_a = 0;262: out_r = 0;263: halt_ack = 0;264: roll_ack = 0;265: valid = 0;266: wait (˜reset);267: end268:269: always wait (halt_req & ˜reset)270: begin :rollback_cycle271: #1;272: halt_ack = 1;273: wait (roll_req);274: #1;275: disable input_cycle;276: disable output_cycle;277: in_a = 0;278: out_r = 0;279:280: #DLY_SEQ_COMP;281: diff = out_s - val_seq; // compare sequence numbers282: if (˜diff [SEQ_WIDTH-1])283: begin284: valid = 0;

147

dq.v

285: out_s = XXX;286: out_g = XXX;287: out_d = XXX;288: end289:290: roll_ack = 1;291: fork292: begin293: wait (˜roll_req);294: #1;295: roll_ack = 0;296: end297: begin298: wait (˜halt_req);299: #1;300: halt_ack = 0;301: end302: join303: end304:305: always wait (in_r & ˜valid & ˜halt_req & ˜reset)306: begin :input_cycle307: #1;308: wait (˜halt_req);309: out_s = in_s;310: out_g = in_g;311: out_d = in_d;312: in_a = 1;313: valid = 1;314: wait (˜in_r);315: #1;316: wait (˜halt_req);317: in_a = 0;318: end319:320: always wait (valid & ˜halt_req & ˜reset)321: begin :output_cycle322: #1;323: wait (˜halt_req);324: out_r = 1;325: wait (out_a);326: #1;327: valid = 0;328: wait (˜halt_req);329: out_r = 0;330: wait (õut_a);331: end332:333: endmodule // dq_buffer_b (behavioral level)334:335: ‘else336:337: // ====================================================================338: // gate level339:340: muller2c x1 (in_r, ˜cout2, clear1, cout1);341: dlr x2 (cout1, ˜halt_req, clear1, latch);342: and #1 g1 (a, complete, ˜roll_req);343: muller2c x3 (a, õut_a, clear2, cout2);344: dlr x4 (cout2, ˜halt_req, clear1, out_r);345: regcmpl x5 (latch, reset, regclk, complete);346: wire in_a = complete;347:348: or #1 g2 (clear1, roll_req, reset);349: and #1 g3 (b, inval, roll_req);350: or #1 g4 (clear2, b, reset);351:352: buf #18 g5 (dhalt, halt_req); // 4 (x2) + 14 (x5)353: and #1 g6 (halt_ack, dhalt, halt_req);354: and #1 g7 (c, roll_req, ˜cout1, ˜in_a, õut_r);355: and #1 g8 (d, clear2, ˜cout2);

148

dq.v

356: and #1 g9 (e, ˜inval, cmp_done);357: or #1 g10 (f, d, e);358: muller3c x6 (c, f, cmp_done, reset, roll_ack);359: seqcmp x7 (out_s, val_seq, roll_req, inval, cmp_done);360:361: ‘ifdef WAVE_DQ362: initial363: #0 $gr_addwaves ("in_r", in_r, "cout1", cout1, "latch", latch, "in_a", in_a,364: "a", a, "cout2", cout2, "out_r", out_r, "out_a", out_a,365: "h_req", halt_req, "h_ack", halt_ack, "r_req", roll_req,366: "clear1", clear1, "clear2", clear2, "r_ack", roll_ack);367: ‘endif368:369: dlr b40 (in_s[0], regclk, reset, out_s[0]),370: b41 (in_s[1], regclk, reset, out_s[1]),371: b42 (in_s[2], regclk, reset, out_s[2]),372: b43 (in_s[3], regclk, reset, out_s[3]);373:374: dl b50 (in_g[0], regclk, out_g[0]),375: b51 (in_g[1], regclk, out_g[1]),376: b52 (in_g[2], regclk, out_g[2]),377: b53 (in_g[3], regclk, out_g[3]),378: b54 (in_g[4], regclk, out_g[4]);379:380: dl b0 (in_d[0], regclk, out_d[0]),381: b1 (in_d[1], regclk, out_d[1]),382: b2 (in_d[2], regclk, out_d[2]),383: b3 (in_d[3], regclk, out_d[3]),384: b4 (in_d[4], regclk, out_d[4]),385: b5 (in_d[5], regclk, out_d[5]),386: b6 (in_d[6], regclk, out_d[6]),387: b7 (in_d[7], regclk, out_d[7]),388: b8 (in_d[8], regclk, out_d[8]),389: b9 (in_d[9], regclk, out_d[9]),390: b10 (in_d[10], regclk, out_d[10]),391: b11 (in_d[11], regclk, out_d[11]),392: b12 (in_d[12], regclk, out_d[12]),393: b13 (in_d[13], regclk, out_d[13]),394: b14 (in_d[14], regclk, out_d[14]),395: b15 (in_d[15], regclk, out_d[15]),396: b16 (in_d[16], regclk, out_d[16]),397: b17 (in_d[17], regclk, out_d[17]),398: b18 (in_d[18], regclk, out_d[18]),399: b19 (in_d[19], regclk, out_d[19]),400: b20 (in_d[20], regclk, out_d[20]),401: b21 (in_d[21], regclk, out_d[21]),402: b22 (in_d[22], regclk, out_d[22]),403: b23 (in_d[23], regclk, out_d[23]),404: b24 (in_d[24], regclk, out_d[24]),405: b25 (in_d[25], regclk, out_d[25]),406: b26 (in_d[26], regclk, out_d[26]),407: b27 (in_d[27], regclk, out_d[27]),408: b28 (in_d[28], regclk, out_d[28]),409: b29 (in_d[29], regclk, out_d[29]),410: b30 (in_d[30], regclk, out_d[30]),411: b31 (in_d[31], regclk, out_d[31]),412: b32 (in_d[32], regclk, out_d[32]); // parity bit413:414: endmodule // dq_buffer_g (gate level)415:416: // ====================================================================417:418: module seqcmp (seq, error, req, inv, ack);419: input [3:0] seq, error;420: input req;421: output inv, ack;422:423: brwend x1 (seq[0], error[0], req, a, na);424: brw x2 (seq[1], error[1], a, na, req, b, nb);425: brw x3 (seq[2], error[2], b, nb, req, c, nc);426: sub x4 (seq[3], error[3], c, nc, req, d, inv, ack);

149

dq.v

427:428: ‘ifdef WAVE_SEQCMP429: initial430: #0 $gr_addwaves ("seq", seq, "error", error, "req", req, "ack", ack,431: "_a", a, "na", na, "b", b, "nb", nb, "c", c, "nc", nc, "d", d,432: "inv", inv);433: ‘endif434:435: endmodule // seqcmp436:437: ‘endif

150

gates.v

1: // Gate-Level Components2:3: module muller2 (in1, in2, out); // 2-input C-element4: input in1, in2;5: output out;6: supply1 vcc;7: supply0 gnd;8:9: nmos #2 m1 (out, gnd, b);10: nmos #1 m2 (b, a, out);11: nmos #1 m3 (a, gnd, in1);12: nmos #1 m4 (a, gnd, in2);13: nmos #1 m6 (b, c, in1);14: nmos #1 m7 (c, gnd, in2);15:16: pullup (b);17: pullup (out);18:19: endmodule // muller220:21: // ====================================================================22:23: module muller3 (in1, in2, in3, out); // 3-input C-element24: input in1, in2, in3;25: output out;26: supply1 vcc;27: supply0 gnd;28:29: nmos #2 m1 (out, gnd, b);30: nmos #1 m2 (b, a, out);31: nmos #1 m3 (a, gnd, in1);32: nmos #1 m4 (a, gnd, in2);33: nmos #1 m5 (a, gnd, in3);34: nmos #1 m6 (b, c, in1);35: nmos #1 m7 (c, d, in2);36: nmos #1 m8 (d, gnd, in3);37:38: pullup (b);39: pullup (out);40:41: endmodule // muller342:43: // ====================================================================44:45: module muller4 (in1, in2, in3, in4, out); // 4-input C-element46: input in1, in2, in3, in4;47: output out;48:49: muller2 x1 (in1, in2, out1);50: muller2 x2 (in3, in4, out2);51: muller2 x3 (out1, out2, out);52:53: endmodule // muller454:55: // ====================================================================56:57: module muller2c (in1, in2, clear, out); // 2-input with clear58: input in1, in2, clear;59: output out;60:61: and #1 g1 (out1, in1, ˜clear);62: and #1 g2 (out2, in2, ˜clear);63: muller2 m1 (out1, out2, out);64:65: endmodule // muller2c66:67: // ====================================================================68:69: module muller3c (in1, in2, in3, clear, out); // 3-input with clear70: input in1, in2, in3, clear;71: output out;

151

gates.v

72:73: and #1 g1 (out1, in1, ˜clear);74: and #1 g2 (out2, in2, ˜clear);75: and #1 g3 (out3, in3, ˜clear);76: muller3 m1 (out1, out2, out3, out);77:78: endmodule // muller3c79:80: // ====================================================================81:82: module dl (D, G, Q); // D-Latch83: input D, G;84: output Q;85: trireg a;86:87: nmos #1 m1 (a, D, G);88: pmos #1 m2 (a, Q, G);89: not #1 g1 (b, a);90: not #1 g2 (Q, b);91:92: endmodule // dl93:94: // ====================================================================95:96: module dls (D, G, S, Q); // D-Latch with Set97: input D, G, S;98: output Q;99: supply1 vcc;100: trireg a;101:102: nmos #1 m1 (a, D, e);103: pmos #1 m2 (a, Q, f);104: nmos #1 m3 (a, vcc, S);105: not #1 g1 (b, a);106: not #1 g2 (Q, b);107: and #1 g3 (e, G, ˜S);108: or #1 g4 (f, G, S);109:110: endmodule // dls111:112: // ====================================================================113:114: module dlr (D, G, R, Q); // D-Latch with Reset115: input D, G, R;116: output Q;117: supply0 gnd;118: trireg a;119:120: nmos #1 m1 (a, D, e);121: pmos #1 m2 (a, Q, f);122: nmos #1 m3 (a, gnd, R);123: not #1 g1 (b, a);124: not #1 g2 (Q, b);125: and #1 g3 (e, G, ˜R);126: or #1 g4 (f, G, R);127:128: endmodule // dlr129:130: // ====================================================================131:132: module dffr (D, CLK, R, Q); // D-Flip Flop with Reset133: input D, CLK, R;134: output Q;135:136: dlr master (D, CLK, R, a);137: dlr slave (a, ˜CLK, R, Q);138:139: endmodule // dffr140:141: // ====================================================================142: // done->complete delay: (1 for g4)+(1 for dl disable)+(1 for safety)

152

gates.v

143: // reset requires 5 time units, critical path in muller3c144:145: module regcmpl (latch, reset, regclk, complete);146: input latch, reset;147: output regclk, complete;148:149: dffr x1 (nosc, latch, reset, osc);150: dls x2 (osc, latch, reset, q1);151: dlr x3 (nosc, latch, reset, q2);152: muller3c x4 (same1, same2, latch, reset, done);153: not #1 g1 (nosc, osc);154: xnor #1 g2 (same1, osc, q1);155: xnor #1 g3 (same2, nosc, q2);156: and #1 g4 (regclk, ˜done, latch);157: and #3 g5 (complete, done, ˜reset);158:159: ‘ifdef WAVE_REGCMPL160: initial161: #0 $gr_addwaves ("latch", latch, "osc", osc, "q1", q1, "q2", q2,162: "same1", same1, "same2", same2, "done", done, "regclk", regclk,163: "complet", complete);164: ‘endif165:166: endmodule // regcmpl167:168: // ====================================================================169: // A minus B, X=borrow in, Z=borrow out, R=request170:171: module brw (A, B, X, NX, R, Z, NZ); // full borrow circuit172: input A, B, X, NX, R;173: output Z, NZ;174: supply1 vcc;175: supply0 gnd;176: trireg Y, NY;177:178: not #1 g1 (NA, A);179: not #1 g2 (NB, B);180: buf #1 g3 (DR, R);181: not #1 g4 (Z, NY);182: not #1 g5 (NZ, Y);183: pmos #1 m1 (NY, vcc, DR);184: pmos #1 m2 (Y, vcc, DR);185: nmos #1 m3 (c, gnd, DR);186:187: nmos #1 m4 (NY, d, X);188: nmos #1 m5 (d, c, NA);189: nmos #1 m6 (d, c, B);190: nmos #1 m7 (NY, e, NA);191: nmos #1 m8 (e, c, B);192: nmos #1 m9 (Y, f, A);193: nmos #1 m10 (f, g, NB);194: nmos #1 m11 (g, c, A);195: nmos #1 m12 (Y, g, NX);196: nmos #1 m13 (g, c, NB);197:198: endmodule // brw199:200: // ====================================================================201: // A minus B, Z=borrow out, R=request202:203: module brwend (A, B, R, Z, NZ); // no borrow in204: input A, B, R;205: output Z, NZ;206: supply1 vcc;207: supply0 gnd;208: trireg Y, NY;209:210: not #1 g1 (NA, A);211: not #1 g2 (NB, B);212: buf #1 g3 (DR, R);213: not #1 g4 (Z, NY);

153

gates.v

214: not #1 g5 (NZ, Y);215: pmos #1 m1 (NY, vcc, DR);216: pmos #1 m2 (Y, vcc, DR);217: nmos #1 m3 (c, gnd, DR);218:219: nmos #1 m4 (NY, d, NA);220: nmos #1 m5 (d, c, B);221: nmos #1 m6 (Y, c, A);222: nmos #1 m7 (Y, c, NB);223:224: endmodule // brwend225:226: // ====================================================================227: // use A/NA for slower signal228:229: module stxor (A, NA, B, NB, R, Z, NZ); // self-timed XOR230: input A, NA, B, NB, R;231: output Z, NZ;232: supply1 vcc;233: supply0 gnd;234: trireg Y, NY;235:236: not #1 g1 (Z, NY);237: not #1 g2 (NZ, Y);238: pmos #1 m1 (NY, vcc, R);239: pmos #1 m2 (Y, vcc, R);240: nmos #1 m3 (c, gnd, R);241:242: nmos #1 m4 (NY, d, NA);243: nmos #1 m5 (d, c, B);244: nmos #1 m6 (NY, e, A);245: nmos #1 m7 (e, c, NB);246: nmos #1 m8 (Y, f, A);247: nmos #1 m9 (f, c, B);248: nmos #1 m10 (Y, f, NB);249: nmos #1 m11 (f, c, NA);250:251: endmodule // stxor252:253: // ====================================================================254: // A minus B, X=borrow in, D=difference, R=request255:256: module sub (A, B, X, NX, REQ, D, ND, ACK); // subtraction bitslice257: input A, B, X, NX, REQ;258: output D, ND, ACK;259: supply1 vcc;260: supply0 gnd;261: trireg D, ND;262:263: not #1 g1 (NA, A);264: not #1 g2 (NB, B);265: buf #1 g3 (DR, REQ);266: or #1 g4 (ACK, D, ND);267:268: stxor x1 (A, NA, B, NB, DR, c, nc);269: stxor x2 (c, nc, X, NX, DR, D, ND);270:271: endmodule // sub

154

iiu.v

1: // Instruction Issuing Unit2:3: module iiu (inst_bus, inst_req, inst_ack, inst_en,4: chkbits, log_req, log_ack, cki_req, cki_ack,5: newpc_req, pc_ack, imem_ack, iq_ack,6: seq_num, reg_rd, reg_rs, reg_rt, res_req, res_ack,7: sim_f, a_bus, b_bus, rfile_req, rfile_ack, ckf_req, ckf_ack,8: alu_func, alu_req, alu_ack,9: mem_rw, mem_req, mem_ack,10: val_seq, halt_req, halt_ack, roll_req, roll_ack,11: retry, stop, reset);12:13: ‘include "parameter"14:15: inout [DATA_WIDTH:0] inst_bus, a_bus, b_bus;16: output [CHECKERS-1:0] chkbits;17: output [SEQ_WIDTH-1:0] seq_num;18: output [REG_WIDTH-1:0] reg_rd, reg_rs, reg_rt;19: output [FUNC_WIDTH-1:0] alu_func;20: input [SEQ_WIDTH-1:0] val_seq;21: input halt_req, roll_req;22: output inst_en, sim_f, mem_rw, retry, stop;23: input inst_req, log_ack, cki_ack, pc_ack, imem_ack, iq_ack, res_ack,24: rfile_ack, ckf_ack, alu_ack, mem_ack, reset;25: output inst_ack, log_req, cki_req, newpc_req, res_req, rfile_req, ckf_req,26: alu_req, mem_req, halt_ack, roll_ack;27:28: reg [CHECKERS-1:0] chkbits;29: reg [SEQ_WIDTH-1:0] seq_num;30: reg [REG_WIDTH-1:0] reg_rd, reg_rs, reg_rt;31: reg [FUNC_WIDTH-1:0] alu_func;32: reg inst_en, sim_f, mem_rw, retry, stop;33: reg inst_ack, log_req, cki_req, newpc_req, res_req, rfile_req, ckf_req,34: alu_req, mem_req, halt_ack, roll_ack;35:36: reg [DATA_WIDTH:0] inst_buf, abus_in, abus_out, bbus_in, bbus_out;37: reg [ADDR_WIDTH:0] int_pc;38: reg [5:0] loop; // data/addr 32 bits max39: reg go_decode, parity;40:41: wire [OP_WIDTH-1:0] opcode = inst_buf [DATA_WIDTH-1:OFFSET_WIDTH];42: wire [REG_WIDTH-1:0] rs = inst_buf [OFFSET_WIDTH-1:OFFSET_WIDTH-REG_WIDTH];43: wire [REG_WIDTH-1:0] rt = inst_buf [OFFSET_WIDTH-REG_WIDTH-1:IMM_WIDTH];44: wire [REG_WIDTH-1:0] rd = inst_buf [IMM_WIDTH-1:IMM_WIDTH-REG_WIDTH];45: wire [OFFSET_WIDTH-1:0] offset = inst_buf [OFFSET_WIDTH-1:0];46: wire [IMM_WIDTH-1:0] imm = inst_buf [IMM_WIDTH-1:0];47: wire [EXTRA_WIDTH-1:0] extra = inst_buf [EXTRA_WIDTH-1:0];48:49:50: assign inst_bus = inst_en ? ZZZ : int_pc;51: assign a_bus = (rfile_req | halt_ack) ? ZZZ : abus_out;52: assign b_bus = rfile_req ? ZZZ : bbus_out;53:54:55: always wait (reset)56: begin57: disable rollback_cycle;58: disable fetch_cycle;59: disable decode_cycle;60: chkbits = XXX;61: seq_num = 0; // initial sequence number62: reg_rs = XXX;63: reg_rt = XXX;64: reg_rd = XXX;65: alu_func = XXX;66: inst_en = 1; // enable instructions67: sim_f = 0;68: mem_rw = XXX;69: retry = 0;70: stop = 0;71: inst_ack = 0; // handshaking signals

155

iiu.v

72: log_req = 0;73: cki_req = 0;74: newpc_req = 0;75: res_req = 0;76: rfile_req = 0;77: ckf_req = 0;78: alu_req = 0;79: mem_req = 0;80: halt_ack = 0;81: roll_ack = 0;82:83: int_pc = 0; // initial internal PC84: abus_out = XXX;85: bbus_out = XXX;86: go_decode = 0;87: wait (˜reset);88: end89:90:91: always wait (halt_req & ˜reset)92: begin :rollback_cycle93: #1;94: wait (˜newpc_req & ˜pc_ack & ˜imem_ack & ˜iq_ack);95: halt_ack = 1;96: wait (roll_req);97: #1;98: disable fetch_cycle;99: disable decode_cycle;100: chkbits = XXX;101: reg_rs = XXX;102: reg_rt = XXX;103: reg_rd = XXX;104: alu_func = XXX;105: inst_en = 0;106: sim_f = 0;107: mem_rw = XXX;108: inst_ack = 0;109: log_req = 0;110: cki_req = 0;111: newpc_req = 0;112: res_req = 0;113: rfile_req = 0;114: ckf_req = 0;115: alu_req = 0;116: mem_req = 0;117:118: abus_out = XXX;119: bbus_out = XXX;120: go_decode = 0;121: retry = 1; // rollback flag122:123: roll_ack = 1;124: wait (˜roll_req);125: #1;126: seq_num = val_seq; // sequence number with error127: int_pc = a_bus; // address with error128: fork129: begin130: roll_ack = 0;131: wait (˜halt_req);132: #1;133: halt_ack = 0;134: end135: pc_handler;136: join137: end138:139:140: always wait (inst_req & ˜go_decode & inst_en & ˜halt_req & ˜reset)141: begin :fetch_cycle142: #1;

156

iiu.v

143: inst_buf = inst_bus;144: bbus_out = inst_bus; // for CKI145: abus_out = int_pc; // for LOG146: #DLY_IIU_PCINC;147: int_pc = int_pc + ADDR_INC;148: parity = 0;149: for (loop=0; loop<ADDR_WIDTH; loop=loop+1)150: parity = parity ˆ int_pc [loop];151: int_pc [ADDR_WIDTH] = parity;152: inst_ack = 1;153: go_decode = 1;154: wait (˜inst_req);155: #1;156: inst_ack = 0;157: end158:159:160: always wait (go_decode & ˜halt_req & ˜reset)161: begin :decode_cycle162: #DLY_IIU_DECOD;163: case (opcode) // determine check bits164: OP_SPECIAL:165: if (extra == SOP_NOP)166: chkbits = 4’b 0001;167: else168: chkbits = 4’b 1011;169: OP_J, OP_TRAP:170: chkbits = 4’b 0001;171: OP_BEQZ, OP_BNEZ, OP_JR, OP_JRF:172: chkbits = 4’b 0011;173: OP_SW, OP_SWF:174: chkbits = 4’b 0111;175: OP_LW:176: chkbits = 4’b 1111;177: OP_JAL:178: chkbits = 4’b 1001;179: default: // JALR, ALU immediate180: chkbits = 4’b 1011;181: endcase182: #1;183:184: wait (˜halt_req);185: log_req = 1; // log instruction186: wait (log_ack);187: #1;188: fork189: begin190: wait (˜halt_req);191: log_req = 0;192: wait (˜log_ack);193: end194: begin195: wait (˜halt_req);196: cki_req = 1; // check instruction197: wait (cki_ack);198: #1;199: wait (˜halt_req);200: cki_req = 0;201: wait (˜cki_ack);202: end203: join204:205: case (opcode)206: OP_SPECIAL: special_handler;207: OP_BEQZ, OP_BNEZ: branch_handler;208: OP_J, OP_JAL, OP_JR, OP_JALR, OP_JRF: jump_handler;209: OP_LW, OP_SW, OP_SWF: memory_handler;210: OP_TRAP: trap_handler;211:212: OP_ADDUI: begin alu_func=FUNC_ADDU; alu_control; end213: OP_SUBUI: begin alu_func=FUNC_SUBU; alu_control; end

157

iiu.v

214: OP_ANDI: begin alu_func=FUNC_AND; alu_control; end215: OP_ORI: begin alu_func=FUNC_OR; alu_control; end216: OP_XORI: begin alu_func=FUNC_XOR; alu_control; end217: OP_LHI: begin alu_func=FUNC_LHI; alu_control; end218: OP_SLLI: begin alu_func=FUNC_SLL; alu_control; end219: OP_SRLI: begin alu_func=FUNC_SRL; alu_control; end220: OP_SRAI: begin alu_func=FUNC_SRA; alu_control; end221: OP_SEQI: begin alu_func=FUNC_SEQ; alu_control; end222: OP_SNEI: begin alu_func=FUNC_SNE; alu_control; end223: OP_SLTI: begin alu_func=FUNC_SLT; alu_control; end224: OP_SGTI: begin alu_func=FUNC_SGT; alu_control; end225: OP_SLEI: begin alu_func=FUNC_SLE; alu_control; end226: OP_SGEI: begin alu_func=FUNC_SGE; alu_control; end227: OP_ADDUIF: begin alu_func=FUNC_ADDUF; alu_control; end228:229: default:230: begin231: $display ("Invalid Opcode %h", inst_buf);232: stop = 1;233: wait (reset);234: end235: endcase236: retry = 0; // reset rollback flag237: go_decode = 0;238: seq_num = seq_num + 1;239: end240:241:242: task special_handler;243: begin244: case (extra)245: SOP_NOP: ;246: SOP_ADDU: begin alu_func=FUNC_ADDU; alu_control; end247: SOP_SUBU: begin alu_func=FUNC_SUBU; alu_control; end248: SOP_AND: begin alu_func=FUNC_AND; alu_control; end249: SOP_OR: begin alu_func=FUNC_OR; alu_control; end250: SOP_XOR: begin alu_func=FUNC_XOR; alu_control; end251: SOP_SLL: begin alu_func=FUNC_SLL; alu_control; end252: SOP_SRL: begin alu_func=FUNC_SRL; alu_control; end253: SOP_SRA: begin alu_func=FUNC_SRA; alu_control; end254: SOP_SEQ: begin alu_func=FUNC_SEQ; alu_control; end255: SOP_SNE: begin alu_func=FUNC_SNE; alu_control; end256: SOP_SLT: begin alu_func=FUNC_SLT; alu_control; end257: SOP_SGT: begin alu_func=FUNC_SGT; alu_control; end258: SOP_SLE: begin alu_func=FUNC_SLE; alu_control; end259: SOP_SGE: begin alu_func=FUNC_SGE; alu_control; end260: SOP_ADDUF: begin alu_func=FUNC_ADDUF; alu_control; end261:262: default:263: begin264: $display ("Invalid Special Opcode %h", inst_buf);265: stop = 1;266: wait (reset);267: end268: endcase269: end270: endtask271:272:273: task alu_control;274: begin275: if ((alu_func == FUNC_ADDUF) & retry) // no error if rollback276: alu_func = FUNC_ADDU;277: reg_rs = rs;278: reg_rt = (opcode == OP_SPECIAL) ? rt : 0;279: reg_rd = (opcode == OP_SPECIAL) ? rd : rt;280: reserve_handler;281: regfile_handler;282: abus_out = abus_in;283: bbus_out = (opcode == OP_SPECIAL) ? bbus_in :284: {{(DATA_WIDTH-IMM_WIDTH){imm[IMM_WIDTH-1]}}, imm};

158

iiu.v

285: alu_handler;286: end287: endtask288:289:290: task memory_handler;291: begin292: reg_rs = rs;293: reg_rt = (opcode == OP_LW) ? 0 : rt;294: reg_rd = (opcode == OP_LW) ? rt : 0;295: reserve_handler;296: regfile_handler;297: mem_rw = (opcode == OP_LW);298: #DLY_IIU_ADD;299: abus_out = abus_in +300: {{(DATA_WIDTH-IMM_WIDTH){imm[IMM_WIDTH-1]}}, imm};301: parity = 0;302: for (loop=0; loop<ADDR_WIDTH; loop=loop+1)303: parity = parity ˆ abus_out [loop];304: abus_out [ADDR_WIDTH] = parity;305: bbus_out = bbus_in;306: if ((opcode == OP_SWF) & ˜retry) // simulate fault307: bbus_out [DATA_WIDTH] = ˜bbus_out [DATA_WIDTH];308: #1;309: wait (˜halt_req);310: mem_req = 1;311: wait (mem_ack);312: #1;313: wait (˜halt_req);314: mem_req = 0;315: wait (˜mem_ack);316: end317: endtask318:319:320: task branch_handler;321: begin322: reg_rs = rs;323: reg_rt = 0;324: reg_rd = 0;325: reserve_handler;326: regfile_handler;327: if (((opcode==OP_BEQZ) && (abus_in==0)) ||328: ((opcode==OP_BNEZ) && (abus_in!=0)))329: begin330: #DLY_IIU_ADD;331: int_pc = int_pc[ADDR_WIDTH-1:0] + {{(DATA_WIDTH-IMM_WIDTH)332: {imm[IMM_WIDTH-1]}}, imm}; // sign ext.333: pc_handler;334: end335: end336: endtask337:338:339: task jump_handler;340: begin341: bbus_out = int_pc[ADDR_WIDTH-1:0]; // for JAL and JALR342:343: if ((opcode==OP_J) || (opcode==OP_JAL))344: begin345: #DLY_IIU_ADD;346: int_pc = int_pc[ADDR_WIDTH-1:0] + {{(DATA_WIDTH-OFFSET_WIDTH)347: {offset[OFFSET_WIDTH-1]}}, offset};348: pc_handler;349: end350: else351: begin352: sim_f = (opcode == OP_JRF) & ˜retry; // simulate fault353: reg_rs = rs;354: reg_rt = 0;355: reg_rd = 0;

159

iiu.v

356: reserve_handler;357: regfile_handler;358: int_pc = abus_in;359: pc_handler;360: end361:362: if ((opcode==OP_JAL) || (opcode==OP_JALR))363: begin364: alu_func = FUNC_PASS;365: reg_rs = 0;366: reg_rt = 0;367: reg_rd = REG_SIZE - 1; // save to highest register368: reserve_handler;369: alu_handler;370: end371: end372: endtask373:374:375: task trap_handler;376: begin377: case (offset)378: TRAP_BLANK: $display();379:380: TRAP_STOP:381: begin382: $display ("Trap Stop %h", inst_buf);383: stop = 1;384: wait (reset);385: end386:387: default:388: if (offset < REG_SIZE)389: begin390: reg_rs = offset;391: reg_rt = 0;392: reg_rd = 0;393: reserve_handler;394: regfile_handler;395: $display ("Reg %d: Dec=%d Hex=%h Bin=%b",396: reg_rs, abus_in [DATA_WIDTH-1:0],397: abus_in , abus_in);398: end399: else400: begin401: $display ("Invalid Trap Service %h", inst_buf);402: stop = 1;403: wait (reset);404: end405: endcase406: end407: endtask408:409:410: task reserve_handler;411: begin412: #1;413: wait (˜halt_req);414: res_req = 1;415: wait (res_ack);416: #1;417: wait (˜halt_req);418: res_req = 0;419: wait (˜res_ack);420: #1;421: end422: endtask423:424:425: task regfile_handler;426: begin

160

iiu.v

427: rfile_req = 1;428: wait (rfile_ack);429: #1;430: wait (˜halt_req);431: abus_in = a_bus;432: bbus_in = b_bus;433: ckf_req = 1; // notify checker434: wait (ckf_ack);435: #1;436: rfile_req = 0;437: wait (˜halt_req);438: ckf_req = 0;439: wait (˜ckf_ack & ˜rfile_ack);440: end441: endtask442:443:444: task alu_handler;445: begin446: parity = 0;447: for (loop=0; loop<DATA_WIDTH; loop=loop+1)448: parity = parity ˆ abus_out [loop];449: abus_out [DATA_WIDTH] = parity;450: parity = 0;451: for (loop=0; loop<DATA_WIDTH; loop=loop+1)452: parity = parity ˆ bbus_out [loop];453: bbus_out [DATA_WIDTH] = parity;454: #1;455: wait (˜halt_req);456: alu_req = 1;457: wait (alu_ack);458: #1;459: wait (˜halt_req);460: alu_req = 0;461: wait (˜alu_ack);462: end463: endtask464:465:466: task pc_handler; // not interrupted by rollback467: begin468: parity = 0;469: for (loop=0; loop<ADDR_WIDTH; loop=loop+1) // calculate parity470: parity = parity ˆ int_pc [loop];471: int_pc [ADDR_WIDTH] = parity;472:473: wait (˜inst_ack); // finish inst fetch cycle474: inst_en = 0;475: #1;476: newpc_req = 1;477: wait (pc_ack & imem_ack & iq_ack);478: #1;479: inst_en = 1;480: newpc_req = 0;481: wait (˜pc_ack & ˜imem_ack & ˜iq_ack);482: end483: endtask484:485: endmodule // iiu

161

imem.v

1: // Instruction Memory2:3: module imem (addr, retry, in_req, in_ack, data, out_req, out_ack,4: cancel_req, cancel_ack, reset);5:6: ‘include "parameter"7:8: input [ADDR_WIDTH:0] addr;9: output [DATA_WIDTH:0] data;10: input retry, in_req, out_ack, cancel_req, reset;11: output in_ack, out_req, cancel_ack;12:13: reg [DATA_WIDTH:0] data, imemory [0:IMEM_SIZE-1];14: reg in_ack, out_req, cancel_ack;15:16: reg [5:0] loop; // DATA_WIDTH = 32 bits max17: reg parity;18:19: initial20: $readmemh ("inst.hex", imemory, 0);21:22: always wait (reset)23: begin24: disable memory_cycle;25: disable cancel_cycle;26: data = XXX;27: in_ack = 0;28: out_req = 0;29: cancel_ack = 0;30: wait (˜reset);31: end32:33: always wait (cancel_req & ˜reset)34: begin :cancel_cycle35: #1;36: disable memory_cycle;37: data = XXX;38: in_ack = 0;39: out_req = 0;40: cancel_ack = 1;41: wait (˜cancel_req);42: #1;43: cancel_ack = 0;44: end45:46: always wait (in_req & ˜cancel_req & ˜reset)47: begin :memory_cycle48: #DLY_IMEM_RD;49: data = imemory [addr[ADDR_WIDTH-1:ADDR_IGNORE]];50:51: if (retry) // correct parity error52: begin53: parity = 0;54: for (loop=0; loop<DATA_WIDTH; loop=loop+1)55: parity = parity ˆ data [loop];56: data [DATA_WIDTH] = parity;57: imemory [addr[ADDR_WIDTH-1:ADDR_IGNORE]] = data;58: end59:60: fork61: begin62: in_ack = 1;63: wait (˜in_req);64: #1;65: in_ack = 0;66: end67: begin68: #1;69: out_req = 1;70: wait (out_ack);71: #1;

162

imem.v

72: out_req = 0;73: wait (˜out_ack);74: end75: join76: end77:78: endmodule // imem

163

iq.v

1: // Instruction Queue2:3: module iq (in_data, in_req, in_ack, tri_data, out_req, out_ack, en_out,4: cancel_req, cancel_ack, reset);5:6: ‘include "parameter"7:8: input [DATA_WIDTH:0] in_data;9: output [DATA_WIDTH:0] tri_data;10: input in_req, out_ack, en_out, cancel_req, reset;11: output in_ack, out_req, cancel_ack;12:13: ‘ifdef BEHAV_IQ14:15: // ====================================================================16: // behavioral level17:18: wire [DATA_WIDTH:0] data1, out_data;19:20: iq_buffer_b buf1 (in_data, in_req, in_ack, data1, req1, ack1, cancel_req,21: cancel_a1, reset);22: iq_buffer_b buf2 (data1, req1, ack1, out_data, out_req, out_ack, cancel_req,23: cancel_a2, reset);24:25: assign tri_data = en_out ? out_data : ZZZ;26:27: reg cancel_ack;28:29: always wait (reset)30: begin31: cancel_ack = 0;32: wait (˜reset);33: end34:35: always @(cancel_a1 or cancel_a2)36: if (cancel_a1 & cancel_a2)37: cancel_ack = 1;38: else if (˜cancel_a1 & ˜cancel_a2)39: cancel_ack = 0;40:41: endmodule // iq (behavioral level)42:43: // ====================================================================44: // behavioral level45:46: module iq_buffer_b (in_d, in_r, in_a, out_d, out_r, out_a, cancel_r,47: cancel_a, reset);48:49: ‘include "parameter"50:51: input [DATA_WIDTH:0] in_d;52: output [DATA_WIDTH:0] out_d;53: input in_r, out_a, cancel_r, reset;54: output in_a, out_r, cancel_a;55:56: reg [DATA_WIDTH:0] out_d;57: reg valid, in_a, out_r, cancel_a;58:59: always wait (reset)60: begin61: disable input_cycle;62: disable output_cycle;63: disable cancel_cycle;64: out_d = XXX;65: valid = 0;66: in_a = 0;67: out_r = 0;68: cancel_a = 0;69: wait (˜reset);70: end71:

164

iq.v

72: always wait (cancel_r & ˜reset)73: begin :cancel_cycle74: #1;75: disable input_cycle;76: disable output_cycle;77: out_d = XXX;78: valid = 0;79: in_a = 0;80: out_r = 0;81: cancel_a = 1;82: wait (˜cancel_r);83: #1;84: cancel_a = 0;85: end86:87: always wait (in_r & ˜valid & ˜cancel_r & ˜reset)88: begin :input_cycle89: #1;90: out_d = in_d;91: in_a = 1;92: valid = 1;93: wait (˜in_r);94: #1;95: in_a = 0;96: end97:98: always wait (valid & ˜cancel_r & ˜reset)99: begin :output_cycle100: #1;101: out_r = 1;102: wait (out_a);103: #1;104: valid = 0;105: out_r = 0;106: wait (˜out_a);107: end108:109: endmodule // iq_buffer_b (behavioral level)110:111: ‘else112:113: // ====================================================================114: // gate level115:116: wire [DATA_WIDTH:0] data1, out_data;117:118: iq_buffer_g buf1 (in_data, in_req, in_ack, data1, req1, ack1, cancel_req,119: cancel_a1, reset);120: iq_buffer_g buf2 (data1, req1, ack1, out_data, out_req, out_ack, cancel_req,121: cancel_a2, reset);122:123: muller2 m1 (cancel_a1, cancel_a2, cancel_ack);124:125: bufif1 #0126: (tri_data[0], out_data[0], en_out),127: (tri_data[1], out_data[1], en_out),128: (tri_data[2], out_data[2], en_out),129: (tri_data[3], out_data[3], en_out),130: (tri_data[4], out_data[4], en_out),131: (tri_data[5], out_data[5], en_out),132: (tri_data[6], out_data[6], en_out),133: (tri_data[7], out_data[7], en_out),134: (tri_data[8], out_data[8], en_out),135: (tri_data[9], out_data[9], en_out),136: (tri_data[10], out_data[10], en_out),137: (tri_data[11], out_data[11], en_out),138: (tri_data[12], out_data[12], en_out),139: (tri_data[13], out_data[13], en_out),140: (tri_data[14], out_data[14], en_out),141: (tri_data[15], out_data[15], en_out),142: (tri_data[16], out_data[16], en_out),

165

iq.v

143: (tri_data[17], out_data[17], en_out),144: (tri_data[18], out_data[18], en_out),145: (tri_data[19], out_data[19], en_out),146: (tri_data[20], out_data[20], en_out),147: (tri_data[21], out_data[21], en_out),148: (tri_data[22], out_data[22], en_out),149: (tri_data[23], out_data[23], en_out),150: (tri_data[24], out_data[24], en_out),151: (tri_data[25], out_data[25], en_out),152: (tri_data[26], out_data[26], en_out),153: (tri_data[27], out_data[27], en_out),154: (tri_data[28], out_data[28], en_out),155: (tri_data[29], out_data[29], en_out),156: (tri_data[30], out_data[30], en_out),157: (tri_data[31], out_data[31], en_out),158: (tri_data[32], out_data[32], en_out); // parity bit159:160: endmodule // iq (gate level)161:162: // ====================================================================163: // gate level164:165: module iq_buffer_g (in_d, in_r, in_a, out_d, out_r, out_a, cancel_r,166: cancel_a, reset);167:168: ‘include "parameter"169:170: input [DATA_WIDTH:0] in_d;171: output [DATA_WIDTH:0] out_d;172: input in_r, out_a, cancel_r, reset;173: output in_a, out_r, cancel_a;174:175: muller2c x1 (in_r, ˜out_r, clear, latch);176: muller2c x2 (complete, ˜out_a, clear, out_r);177: regcmpl x3 (latch, clear, regclk, complete);178: or #1 g1 (clear, cancel_r, reset);179: buf #7 g2 (cancel_a, cancel_r);180: wire in_a = complete;181:182: ‘ifdef WAVE_IQ183: initial184: #0 $gr_addwaves ("in_d", in_d, "out_d", out_d, "in_r", in_r, "in_a", in_a,185: "out_r", out_r, "out_a", out_a, "can_r", cancel_r,186: "can_a", cancel_a);187: ‘endif188:189: dl b0 (in_d[0], regclk, out_d[0]),190: b1 (in_d[1], regclk, out_d[1]),191: b2 (in_d[2], regclk, out_d[2]),192: b3 (in_d[3], regclk, out_d[3]),193: b4 (in_d[4], regclk, out_d[4]),194: b5 (in_d[5], regclk, out_d[5]),195: b6 (in_d[6], regclk, out_d[6]),196: b7 (in_d[7], regclk, out_d[7]),197: b8 (in_d[8], regclk, out_d[8]),198: b9 (in_d[9], regclk, out_d[9]),199: b10 (in_d[10], regclk, out_d[10]),200: b11 (in_d[11], regclk, out_d[11]),201: b12 (in_d[12], regclk, out_d[12]),202: b13 (in_d[13], regclk, out_d[13]),203: b14 (in_d[14], regclk, out_d[14]),204: b15 (in_d[15], regclk, out_d[15]),205: b16 (in_d[16], regclk, out_d[16]),206: b17 (in_d[17], regclk, out_d[17]),207: b18 (in_d[18], regclk, out_d[18]),208: b19 (in_d[19], regclk, out_d[19]),209: b20 (in_d[20], regclk, out_d[20]),210: b21 (in_d[21], regclk, out_d[21]),211: b22 (in_d[22], regclk, out_d[22]),212: b23 (in_d[23], regclk, out_d[23]),213: b24 (in_d[24], regclk, out_d[24]),

166

iq.v

214: b25 (in_d[25], regclk, out_d[25]),215: b26 (in_d[26], regclk, out_d[26]),216: b27 (in_d[27], regclk, out_d[27]),217: b28 (in_d[28], regclk, out_d[28]),218: b29 (in_d[29], regclk, out_d[29]),219: b30 (in_d[30], regclk, out_d[30]),220: b31 (in_d[31], regclk, out_d[31]),221: b32 (in_d[32], regclk, out_d[32]); // parity bit222:223: endmodule // iq_buffer_g (gate level)224:225: ‘endif

167

log.v

1: // Instruction Log2:3: module log (log_seq, log_chkbits, a_bus, log_req, log_ack,4: chk_seq, chk_chkid, chk_error, chk_req, chk_ack,5: val_seq, valmem_req, valmem_ack, valreg_req, valreg_ack,6: halt_req, halt_ack, roll_req, roll_ack, reset);7:8: ‘include "parameter"9:10: input [SEQ_WIDTH-1:0] log_seq, chk_seq;11: input [CHECKERS-1:0] log_chkbits;12: inout [DATA_WIDTH:0] a_bus;13: input [CHKID_WIDTH-1:0] chk_chkid;14: output [SEQ_WIDTH-1:0] val_seq;15: input chk_error, reset;16: input log_req, chk_req, valmem_ack, valreg_ack, halt_ack, roll_ack;17: output log_ack, chk_ack, valmem_req, valreg_req, halt_req, roll_req;18:19: reg [SEQ_WIDTH-1:0] val_seq;20: reg chk_ack, valmem_req, valreg_req, halt_req, roll_req;21:22: reg [SEQ_WIDTH-1:0] clr_seq;23: reg [CHKID_WIDTH-1:0] clr_id;24: reg [DWBS-1:0] buf_dwb;25: reg del_ack, clr_req, arbdel_req, arbdel_ack, arbchk_req, arbchk_ack;26:27: wire [DWBS-1:0] log_dwb = log_chkbits [CHECKERS-1:CHECKERS-DWBS];28: wire [SEQ_WIDTH-1:0] del_seq;29: wire [CHECKERS-1:0] del_chkbits;30: wire [DWBS-1:0] del_dwb;31: wire del_req, clr_ack;32:33: logq logq (log_seq, log_chkbits, log_dwb, a_bus, log_req, log_ack,34: del_seq, del_chkbits, del_dwb, del_req, del_ack,35: clr_seq, clr_id, clr_req, clr_ack,36: halt_req, haltq_ack, roll_req, rollq_ack, reset);37:38: always wait (reset)39: begin40: disable delete_cycle;41: disable checker_cycle;42: disable arb_cycle;43: val_seq = XXX;44: chk_ack = 0;45: valmem_req = 0;46: valreg_req = 0;47: halt_req = 0;48: roll_req = 0;49: del_ack = 0;50: clr_req = 0;51: arbdel_req = 0;52: arbdel_ack = 0;53: arbchk_req = 0;54: arbchk_ack = 0;55: wait (˜reset);56: end57:58: always wait (chk_req & ˜reset)59: begin :checker_cycle60: #1;61: clr_seq = chk_seq;62: clr_id = chk_chkid;63: if (chk_error)64: begin // rollback process65: arbchk_req = 1;66: wait (arbchk_ack);67: #1;68: val_seq = chk_seq; // sequence number with error69: halt_req = 1;70: wait (halt_ack & haltq_ack);71: #1;

168

log.v

72: roll_req = 1;73: disable delete_cycle; // take care of local business74: arbdel_req = 0;75: wait (roll_ack & rollq_ack);76: #1;77: roll_req = 0;78: wait (˜roll_ack & ˜rollq_ack);79: #1;80: halt_req = 0;81: arbchk_req = 0;82: wait (˜halt_ack & ˜haltq_ack & ãrbchk_ack);83: end84: else85: fork86: begin // clear the check bit87: #1;88: clr_req = 1;89: wait (clr_ack);90: #1;91: clr_req = 0;92: wait (˜clr_ack);93: end94: begin95: chk_ack = 1; // finish K-bus transaction96: wait (˜chk_req);97: #1;98: chk_ack = 0;99: end100: join101: end102:103: always wait (del_req & ˜halt_req & ˜reset)104: begin :delete_cycle105: wait (del_chkbits == 0); // until all bits are cleared106: #1;107: arbdel_req = 1;108: wait (arbdel_ack);109: #1;110: val_seq = del_seq;111: buf_dwb = del_dwb;112: del_ack = 1; // delete from log113: fork114: begin115: wait (˜del_req);116: #1;117: del_ack = 0;118: end119: case (buf_dwb) // validate appropriate DWB120: 2’b01:121: begin122: #1;123: valmem_req = 1;124: wait (valmem_ack);125: #1;126: valmem_req = 0;127: arbdel_req = 0;128: wait (˜valmem_ack & ãrbdel_ack);129: end130: 2’b10, 2’b11:131: begin132: #1;133: valreg_req = 1;134: wait (valreg_ack);135: #1;136: valreg_req = 0;137: arbdel_req = 0;138: wait (˜valreg_ack & ãrbdel_ack);139: end140: default:141: begin142: arbdel_req = 0;

169

log.v

143: wait (ãrbdel_ack);144: end145: endcase146: join147: end148:149: always wait ((arbchk_req | arbdel_req) & ˜reset)150: begin :arb_cycle151: #1;152: if (arbchk_req) // invalidate has priority over validate153: begin154: arbchk_ack = 1;155: wait (ãrbchk_req);156: #1;157: arbchk_ack = 0;158: end159: else160: begin161: arbdel_ack = 1;162: wait (ãrbdel_req);163: #1;164: arbdel_ack = 0;165: end166: end167:168: endmodule // log169:170: // ====================================================================171: // Log Queue: Note that address is not used in the output.172:173: module logq (in_seq, in_chkbits, in_dwb, a_bus, in_req, in_ack,174: out_seq, out_chkbits, out_dwb, out_req, out_ack,175: clr_seq, clr_id, clr_req, clr_ack,176: halt_req, halt_ack, roll_req, roll_ack, reset);177:178: ‘include "parameter"179:180: input [SEQ_WIDTH-1:0] in_seq, clr_seq;181: input [CHECKERS-1:0] in_chkbits;182: input [DWBS-1:0] in_dwb;183: inout [DATA_WIDTH:0] a_bus;184: output [SEQ_WIDTH-1:0] out_seq;185: output [CHECKERS-1:0] out_chkbits;186: output [DWBS-1:0] out_dwb;187: input [CHKID_WIDTH-1:0] clr_id;188: input in_req, out_ack, clr_req, halt_req, roll_req, reset;189: output in_ack, out_req, clr_ack, halt_ack, roll_ack;190:191: reg clr_ack, halt_ack, roll_ack;192:193: wire [SEQ_WIDTH-1:0] seq1, seq2, seq3, seq4, seq5, seq6, seq7;194: wire [CHECKERS-1:0] chkbits1, chkbits2, chkbits3, chkbits4, chkbits5,195: chkbits6, chkbits7;196: wire [DWBS-1:0] dwb1, dwb2, dwb3, dwb4, dwb5, dwb6, dwb7;197: wire [ADDR_WIDTH:0] addr1, addr2, addr3, addr4, addr5, addr6, addr7,198: not_used, tri_addr;199: wire [ADDR_WIDTH:0] adpt_addr = a_bus; // "addr" adaptor200:201: assign a_bus = {ZZZ, tri_addr}; // compensate for narrow bus202:203:204: log_buf buf1 (in_seq, in_chkbits, in_dwb, adpt_addr, in_req, in_ack,205: seq1, chkbits1, dwb1, addr1, req1, ack1,206: clr_seq, clr_id, clr_req, clr_a1,207: halt_req, halt_a1, roll_req, roll_a1, tri_addr, reset);208:209: log_buf buf2 (seq1, chkbits1, dwb1, addr1, req1, ack1,210: seq2, chkbits2, dwb2, addr2, req2, ack2,211: clr_seq, clr_id, clr_req, clr_a2,212: halt_req, halt_a2, roll_req, roll_a2, tri_addr, reset);213:

170

log.v

214: log_buf buf3 (seq2, chkbits2, dwb2, addr2, req2, ack2,215: seq3, chkbits3, dwb3, addr3, req3, ack3,216: clr_seq, clr_id, clr_req, clr_a3,217: halt_req, halt_a3, roll_req, roll_a3, tri_addr, reset);218:219: log_buf buf4 (seq3, chkbits3, dwb3, addr3, req3, ack3,220: seq4, chkbits4, dwb4, addr4, req4, ack4,221: clr_seq, clr_id, clr_req, clr_a4,222: halt_req, halt_a4, roll_req, roll_a4, tri_addr, reset);223:224: log_buf buf5 (seq4, chkbits4, dwb4, addr4, req4, ack4,225: seq5, chkbits5, dwb5, addr5, req5, ack5,226: clr_seq, clr_id, clr_req, clr_a5,227: halt_req, halt_a5, roll_req, roll_a5, tri_addr, reset);228:229: log_buf buf6 (seq5, chkbits5, dwb5, addr5, req5, ack5,230: seq6, chkbits6, dwb6, addr6, req6, ack6,231: clr_seq, clr_id, clr_req, clr_a6,232: halt_req, halt_a6, roll_req, roll_a6, tri_addr, reset);233:234: log_buf buf7 (seq6, chkbits6, dwb6, addr6, req6, ack6,235: seq7, chkbits7, dwb7, addr7, req7, ack7,236: clr_seq, clr_id, clr_req, clr_a7,237: halt_req, halt_a7, roll_req, roll_a7, tri_addr, reset);238:239: log_buf buf8 (seq7, chkbits7, dwb7, addr7, req7, ack7,240: out_seq, out_chkbits, out_dwb, not_used, out_req, out_ack,241: clr_seq, clr_id, clr_req, clr_a8,242: halt_req, halt_a8, roll_req, roll_a8, tri_addr, reset);243:244:245: always wait (reset)246: begin247: clr_ack = 0;248: halt_ack = 0;249: roll_ack = 0;250: wait (˜reset);251: end252:253: always @(clr_a1 or clr_a2 or clr_a3 or clr_a4 or clr_a5 or clr_a6 or254: clr_a7 or clr_a8)255: if (clr_a1 & clr_a2 & clr_a3 & clr_a4 & clr_a5 & clr_a6 & clr_a7 &256: clr_a8)257: clr_ack = 1;258: else if (˜clr_a1 & ˜clr_a2 & ˜clr_a3 & ˜clr_a4 & ˜clr_a5 & ˜clr_a6 &259: ˜clr_a7 & ˜clr_a8)260: clr_ack = 0;261:262: always @(halt_a1 or halt_a2 or halt_a3 or halt_a4 or halt_a5 or halt_a6 or263: halt_a7 or halt_a8)264: if (halt_a1 & halt_a2 & halt_a3 & halt_a4 & halt_a5 & halt_a6 &265: halt_a7 & halt_a8)266: halt_ack = 1;267: else if (˜halt_a1 & ˜halt_a2 & ˜halt_a3 & ˜halt_a4 & ˜halt_a5 &268: ˜halt_a6 & ˜halt_a7 & ˜halt_a8)269: halt_ack = 0;270:271: always @(roll_a1 or roll_a2 or roll_a3 or roll_a4 or roll_a5 or roll_a6 or272: roll_a7 or roll_a8)273: if (roll_a1 & roll_a2 & roll_a3 & roll_a4 & roll_a5 & roll_a6 &274: roll_a7 & roll_a8)275: roll_ack = 1;276: else if (˜roll_a1 & ˜roll_a2 & ˜roll_a3 & ˜roll_a4 & ˜roll_a5 &277: ˜roll_a6 & ˜roll_a7 & ˜roll_a8)278: roll_ack = 0;279:280: endmodule // logq281:282: // ====================================================================283:284: module log_buf (in_s, in_k, in_w, in_d, in_r, in_a,

171

log.v

285: out_s, out_k, out_w, out_d, out_r, out_a,286: clr_s, clr_i, clr_r, clr_a,287: halt_r, halt_a, roll_r, roll_a, tri_addr, reset);288:289: ‘include "parameter"290:291: input [SEQ_WIDTH-1:0] in_s, clr_s;292: input [CHECKERS-1:0] in_k;293: input [DWBS-1:0] in_w;294: input [ADDR_WIDTH:0] in_d;295: output [SEQ_WIDTH-1:0] out_s;296: output [CHECKERS-1:0] out_k;297: output [DWBS-1:0] out_w;298: output [ADDR_WIDTH:0] out_d, tri_addr;299: input [CHKID_WIDTH-1:0] clr_i;300: input in_r, out_a, clr_r, halt_r, roll_r, reset;301: output in_a, out_r, clr_a, halt_a, roll_a;302:303: reg [SEQ_WIDTH-1:0] out_s;304: reg [CHECKERS-1:0] out_k;305: reg [DWBS-1:0] out_w;306: reg [ADDR_WIDTH:0] out_d, tri_addr;307: reg in_a, out_r, clr_a, halt_a, roll_a;308:309: reg [SEQ_WIDTH-1:0] diff;310: reg valid;311:312: always wait (reset)313: begin314: disable clear_cycle;315: disable rollback_cycle;316: disable input_cycle;317: disable output_cycle;318: out_s = XXX;319: out_k = XXX;320: out_w = XXX;321: out_d = XXX;322: tri_addr = ZZZ;323: in_a = 0;324: out_r = 0;325: clr_a = 0;326: halt_a = 0;327: roll_a = 0;328: valid = 0;329: wait (˜reset);330: end331:332: always wait (clr_r & ˜reset) // clear check bit333: begin :clear_cycle334: #1;335: clr_a = 1;336: wait (˜clr_r);337: #1;338: if (out_s == clr_s)339: out_k [clr_i] = 0;340: #1;341: clr_a = 0;342: end343:344: always wait (halt_r & ˜reset)345: begin :rollback_cycle346: #1;347: halt_a = 1;348: wait (roll_r);349: #1;350: disable input_cycle;351: disable output_cycle;352: in_a = 0;353: out_r = 0;354: if (valid & (out_s == clr_s))355: tri_addr = out_d; // address to be rolled back

172

log.v

356:357: #DLY_SEQ_COMP;358: diff = out_s - clr_s; // compare sequence numbers359: if (˜diff [SEQ_WIDTH-1])360: begin361: valid = 0; // invalidate entry362: out_s = XXX;363: out_k = XXX;364: out_w = XXX;365: out_d = XXX;366: end367:368: roll_a = 1;369: fork370: begin371: wait (˜roll_r);372: #1;373: roll_a = 0;374: end375: begin376: wait (˜halt_r);377: #1;378: tri_addr = ZZZ; // put back tri-state379: halt_a = 0;380: end381: join382: end383:384: always wait (in_r & ˜valid & ˜clr_r & ˜halt_r & ˜reset)385: begin :input_cycle386: #1;387: wait (˜clr_r & ˜clr_a & ˜halt_r);388: out_s = in_s;389: out_k = in_k;390: out_w = in_w;391: out_d = in_d;392: in_a = 1;393: valid = 1;394: wait (˜in_r);395: #1;396: wait (˜clr_r & ˜clr_a & ˜halt_r);397: in_a = 0;398: end399:400: always wait (valid & ˜clr_r & ˜halt_r & ˜reset)401: begin :output_cycle402: #1;403: wait (˜clr_r & ˜clr_a & ˜halt_r);404: out_r = 1;405: wait (out_a);406: #1;407: valid = 0;408: wait (˜clr_r & ˜clr_a & ˜halt_r);409: out_r = 0;410: wait (˜out_a);411: end412:413: endmodule // log_buf

173

memdwb.v

1: // Memory Delayed Write Buffer: Note initial wait bit = 12:3: module memdwb (in_seq, in_reg, in_addr, in_data, in_rw_mode, in_retry,4: in_req, in_ack,5: chk_req, chk_ack, val_seq, val_req, val_ack,6: out_addr, tri_data, rw_mode, out_retry, out_req, out_ack,7: out_seq, out_reg, tri_dq_req, dq_ack,8: halt_req, halt_ack, roll_req, roll_ack, reset);9:10: ‘include "parameter"11:12: input [SEQ_WIDTH-1:0] in_seq, val_seq;13: input [REG_WIDTH-1:0] in_reg;14: input [DATA_WIDTH:0] in_addr, in_data;15: output [ADDR_WIDTH:0] out_addr;16: output [DATA_WIDTH:0] tri_data;17: output [SEQ_WIDTH-1:0] out_seq;18: output [REG_WIDTH-1:0] out_reg;19: input in_rw_mode, in_retry, reset;20: output rw_mode, out_retry;21: input in_req, chk_ack, val_req, out_ack, dq_ack, halt_req, roll_req;22: output in_ack, chk_req, val_ack, out_req, tri_dq_req, halt_ack, roll_ack;23:24: reg [ADDR_WIDTH:0] out_addr;25: reg [SEQ_WIDTH-1:0] out_seq;26: reg [REG_WIDTH-1:0] out_reg;27: reg rw_mode, out_retry;28: reg in_ack, chk_req, out_req, tri_dq_req, halt_ack, roll_ack;29:30: reg [ADDR_WIDTH:0] rd_addr;31: reg [DATA_WIDTH:0] out_data;32: reg [SEQ_WIDTH-1:0] diff;33: reg go_read, arbrd_req, arbrd_ack, arbwr_req, arbwr_ack, wrq_req,34: com_ack, rdq_req, haltc_ack, rollc_ack;35:36: wire [ADDR_WIDTH:0] adpt_addr = in_addr; // "addr" adaptor37: wire [ADDR_WIDTH:0] com_addr;38: wire [DATA_WIDTH:0] com_data, tri_qout;39:40: memdwbq memdwbq (in_seq, 1’b1, adpt_addr, in_data, wrq_req, wrq_ack,41: com_wait, com_addr, com_data, com_req, com_ack,42: val_seq, val_req, val_ack,43: rd_addr, rdq_req, rdq_ack, match, tri_qout,44: halt_req, haltq_ack, roll_req, rollq_ack, reset);45:46: assign tri_data = rw_mode ? ZZZ : out_data;47:48: always wait (reset)49: begin50: disable rollback_cycle;51: disable input_cycle;52: disable read_cycle;53: disable commit_cycle;54: disable arb_cycle;55: out_addr = XXX;56: out_seq = XXX;57: out_reg = XXX;58: rw_mode = XXX;59: out_retry = 0;60: in_ack = 0;61: chk_req = 0;62: out_req = 0;63: tri_dq_req = ZZZ;64: halt_ack = 0;65: roll_ack = 0;66: out_data = XXX;67: go_read = 0;68: arbrd_req = 0;69: arbrd_ack = 0;70: arbwr_req = 0;71: arbwr_ack = 0;

174

memdwb.v

72: wrq_req = 0;73: com_ack = 0;74: rdq_req = 0;75: haltc_ack = 0;76: rollc_ack = 0;77: wait (˜reset);78: end79:80: always @(haltq_ack or haltc_ack)81: if (haltq_ack & haltc_ack)82: halt_ack = 1;83: else if (˜haltq_ack & ˜haltc_ack)84: halt_ack = 0;85:86: always @(rollq_ack or rollc_ack)87: if (rollq_ack & rollc_ack)88: roll_ack = 1;89: else if (˜rollq_ack & ˜rollc_ack)90: roll_ack = 0;91:92: always wait (halt_req & ˜reset)93: begin :rollback_cycle94: #1;95: haltc_ack = 1;96: wait (roll_req);97: #1;98: disable input_cycle;99: disable read_cycle;100: disable commit_cycle;101: disable arb_cycle;102: out_addr = XXX;103: rw_mode = XXX;104: out_retry = 0;105: in_ack = 0;106: chk_req = 0;107: out_req = 0;108: tri_dq_req = ZZZ;109: out_data = XXX;110: arbrd_req = 0;111: arbrd_ack = 0;112: arbwr_req = 0;113: arbwr_ack = 0;114: wrq_req = 0;115: com_ack = 0;116: rdq_req = 0;117:118: #DLY_SEQ_COMP;119: diff = out_seq - val_seq; // compare sequence numbers120: if (˜diff [SEQ_WIDTH-1])121: begin122: go_read = 0;123: out_seq = XXX;124: out_reg = XXX;125: end126:127: rollc_ack = 1;128: fork129: begin130: wait (˜roll_req);131: #1;132: rollc_ack = 0;133: end134: begin135: wait (˜halt_req);136: #1;137: haltc_ack = 0;138: end139: join140: end141:142: always wait (in_req & ˜halt_req & ˜reset)

175

memdwb.v

143: begin :input_cycle144: #1;145: if (in_rw_mode)146: begin147: wait (˜go_read & ˜halt_req);148: out_seq = in_seq;149: out_reg = in_reg;150: rd_addr = in_addr;151: out_retry = in_retry;152: #1;153: wait (˜halt_req);154: go_read = 1; // start read cycle155: chk_req = 1; // notify checker156: wait (chk_ack);157: #1;158: fork // finish up transactions159: begin160: wait (˜halt_req);161: chk_req = 0;162: wait (˜chk_ack);163: end164: begin165: wait (˜halt_req);166: in_ack = 1;167: wait (˜in_req);168: #1;169: wait (˜halt_req);170: in_ack = 0;171: end172: join173: end174: else175: begin176: wrq_req = 1; // write to queue177: wait (wrq_ack); // wait until actually accepted178: #1;179: chk_req = 1; // notify checker180: wait (chk_ack);181: #1;182: fork // finish up transactions183: begin184: wrq_req = 0;185: chk_req = 0;186: wait (˜wrq_ack & ˜chk_ack);187: end188: begin189: in_ack = 1;190: wait (˜in_req);191: #1;192: in_ack = 0;193: end194: join195: end196: end197:198: always wait (go_read & ˜halt_req & ˜reset)199: begin :read_cycle200: #1;201: rdq_req = 1; // search queue first202: wait (rdq_ack);203: #1;204: arbrd_req = 1; // request memory cycle205: wait (arbrd_ack);206: #1;207: if (match)208: begin // data still in queue209: out_data = tri_qout;210: rdq_req = 0; // release queue211: rw_mode = 0; // write mode to access bus212: #1;213: wait (˜halt_req);

176

memdwb.v

214: tri_dq_req = 1; // write DQ directly215: wait (dq_ack);216: #1;217: go_read = 0;218: wait (˜halt_req);219: tri_dq_req = ZZZ; // tri-state it220: arbrd_req = 0;221: wait (˜dq_ack & ãrbrd_ack & ˜rdq_ack);222: end223: else224: begin225: rdq_req = 0; // release queue226: out_addr = rd_addr;227: rw_mode = 1; // read mode228: #1;229: wait (˜halt_req);230: out_req = 1;231: wait (out_ack);232: #1;233: go_read = 0;234: wait (˜halt_req);235: out_req = 0;236: arbrd_req = 0;237: wait (õut_ack & ãrbrd_ack & ˜rdq_ack);238: end239: end240:241: always wait (com_req & ˜halt_req & ˜reset)242: begin :commit_cycle243: wait (˜com_wait); // wait until validated244: #1;245: arbwr_req = 1; // request memory cycle246: wait (arbwr_ack);247: #1;248: out_addr = com_addr;249: out_data = com_data;250: rw_mode = 0; // write mode251: #1;252: wait (˜halt_req);253: out_req = 1; // write to data memory254: wait (out_ack);255: #1;256: wait (˜halt_req);257: com_ack = 1; // delete from queue258: out_req = 0;259: arbwr_req = 0;260: wait (˜com_req);261: #1;262: wait (˜halt_req);263: com_ack = 0;264: wait (õut_ack & ãrbwr_ack);265: end266:267: always wait ((arbrd_req | arbwr_req) & ˜halt_req & ˜reset)268: begin :arb_cycle269: #1;270: if (arbrd_req) // read has priority over write271: begin272: arbrd_ack = 1;273: wait (ãrbrd_req);274: #1;275: arbrd_ack = 0;276: end277: else278: begin279: arbwr_ack = 1;280: wait (ãrbwr_req);281: #1;282: arbwr_ack = 0;283: end284: end

177

memdwb.v

285:286: endmodule // memdwb287:288: // ====================================================================289: // MEM_DWB Queue: Note that sequence number is not used in the output.290:291: module memdwbq (in_seq, in_wait, in_addr, in_data, in_req, in_ack,292: out_wait, out_addr, out_data, out_req, out_ack,293: val_seq, val_req, val_ack,294: read_addr, read_req, read_ack, match, tri_out,295: halt_req, halt_ack, roll_req, roll_ack, reset);296:297: ‘include "parameter"298:299: input [SEQ_WIDTH-1:0] in_seq, val_seq;300: input [ADDR_WIDTH:0] in_addr, read_addr;301: input [DATA_WIDTH:0] in_data;302: output [ADDR_WIDTH:0] out_addr;303: output [DATA_WIDTH:0] out_data, tri_out;304: input in_wait, reset;305: output out_wait, match;306: input in_req, out_ack, val_req, read_req, halt_req, roll_req;307: output in_ack, out_req, val_ack, read_ack, halt_ack, roll_ack;308:309: reg val_ack, read_ack, halt_ack, roll_ack;310:311: reg read_ready, read_en1, read_en2, read_en3, read_en4, haltr_ack, rollr_ack;312:313: wire [SEQ_WIDTH-1:0] seq1, seq2, seq3, not_used;314: wire [ADDR_WIDTH:0] addr1, addr2, addr3;315: wire [DATA_WIDTH:0] data1, data2, data3;316: wire match1, match2, match3, match4;317:318: assign match = match1 | match2 | match3 | match4;319:320:321: memdwb_buf buf1 (in_seq, in_wait, in_addr, in_data, in_req, in_ack,322: seq1, wait1, addr1, data1, req1, ack1,323: val_seq, val_req, val_ack1,324: read_addr, read_req, read_ack1, match1, read_en1, tri_out,325: halt_req, halt_a1, roll_req, roll_a1, reset);326:327: memdwb_buf buf2 (seq1, wait1, addr1, data1, req1, ack1,328: seq2, wait2, addr2, data2, req2, ack2,329: val_seq, val_req, val_ack2,330: read_addr, read_req, read_ack2, match2, read_en2, tri_out,331: halt_req, halt_a2, roll_req, roll_a2, reset);332:333: memdwb_buf buf3 (seq2, wait2, addr2, data2, req2, ack2,334: seq3, wait3, addr3, data3, req3, ack3,335: val_seq, val_req, val_ack3,336: read_addr, read_req, read_ack3, match3, read_en3, tri_out,337: halt_req, halt_a3, roll_req, roll_a3, reset);338:339: memdwb_buf buf4 (seq3, wait3, addr3, data3, req3, ack3,340: not_used, out_wait, out_addr, out_data, out_req, out_ack,341: val_seq, val_req, val_ack4,342: read_addr, read_req, read_ack4, match4, read_en4, tri_out,343: halt_req, halt_a4, roll_req, roll_a4, reset);344:345:346: always wait (reset)347: begin348: disable rollback_cycle;349: disable read_cycle;350: val_ack = 0;351: read_ack = 0;352: halt_ack = 0;353: roll_ack = 0;354: read_ready = 0;355: read_en1 = 0;

178

memdwb.v

356: read_en2 = 0;357: read_en3 = 0;358: read_en4 = 0;359: haltr_ack = 0;360: rollr_ack = 0;361: wait (˜reset);362: end363:364: always @(val_ack1 or val_ack2 or val_ack3 or val_ack4)365: if (val_ack1 & val_ack2 & val_ack3 & val_ack4)366: val_ack = 1;367: else if (˜val_ack1 & ˜val_ack2 & ˜val_ack3 & ˜val_ack4)368: val_ack = 0;369:370: always @(read_ack1 or read_ack2 or read_ack3 or read_ack4)371: if (read_ack1 & read_ack2 & read_ack3 & read_ack4)372: read_ready = 1;373: else if (˜read_ack1 & ˜read_ack2 & ˜read_ack3 & ˜read_ack4)374: read_ready = 0;375:376: always @(halt_a1 or halt_a2 or halt_a3 or halt_a4 or haltr_ack)377: if (halt_a1 & halt_a2 & halt_a3 & halt_a4 & haltr_ack)378: halt_ack = 1;379: else if (˜halt_a1 & ˜halt_a2 & ˜halt_a3 & ˜halt_a4 & ˜haltr_ack)380: halt_ack = 0;381:382: always @(roll_a1 or roll_a2 or roll_a3 or roll_a4 or rollr_ack)383: if (roll_a1 & roll_a2 & roll_a3 & roll_a4 & rollr_ack)384: roll_ack = 1;385: else if (˜roll_a1 & ˜roll_a2 & ˜roll_a3 & ˜roll_a4 & ˜rollr_ack)386: roll_ack = 0;387:388: always wait (halt_req & ˜reset)389: begin :rollback_cycle390: #1;391: haltr_ack = 1;392: wait (roll_req);393: #1;394: disable read_cycle;395: read_ready = 0;396: read_ack = 0;397: read_en1 = 0;398: read_en2 = 0;399: read_en3 = 0;400: read_en4 = 0;401:402: rollr_ack = 1;403: fork404: begin405: wait (˜roll_req);406: #1;407: rollr_ack = 0;408: end409: begin410: wait (˜halt_req);411: #1;412: haltr_ack = 0;413: end414: join415: end416:417: always wait (read_ready & ˜halt_req & ˜reset)418: begin :read_cycle419: #1;420: casex ({match1, match2, match3, match4}) // priority decoder421: 4’b1???: read_en1 = 1;422: 4’b01??: read_en2 = 1;423: 4’b001?: read_en3 = 1;424: 4’b0001: read_en4 = 1;425: endcase426: #1;

179

memdwb.v

427: read_ack = 1;428: wait (˜read_ready);429: #1;430: read_en1 = 0;431: read_en2 = 0;432: read_en3 = 0;433: read_en4 = 0;434: read_ack = 0;435: end436:437: endmodule // memdwbq438:439: // ====================================================================440:441: module memdwb_buf (in_seq, in_wait, in_addr, in_data, in_req, in_ack,442: out_seq, out_wait, out_addr, out_data, out_req, out_ack,443: val_seq, val_req, val_ack,444: read_addr, read_req, read_ack, match, read_en, tri_out,445: halt_req, halt_ack, roll_req, roll_ack, reset);446:447: ‘include "parameter"448:449: input [SEQ_WIDTH-1:0] in_seq, val_seq;450: input [ADDR_WIDTH:0] in_addr, read_addr;451: input [DATA_WIDTH:0] in_data;452: output [SEQ_WIDTH-1:0] out_seq;453: output [ADDR_WIDTH:0] out_addr;454: output [DATA_WIDTH:0] out_data, tri_out;455: input in_wait, read_en, reset;456: output out_wait, match;457: input in_req, out_ack, val_req, read_req, halt_req, roll_req;458: output in_ack, out_req, val_ack, read_ack, halt_ack, roll_ack;459:460: reg [SEQ_WIDTH-1:0] out_seq;461: reg [ADDR_WIDTH:0] out_addr;462: reg [DATA_WIDTH:0] out_data;463: reg out_wait, match;464: reg in_ack, out_req, val_ack, read_ack, halt_ack, roll_ack;465:466: reg [SEQ_WIDTH-1:0] diff;467: reg valid;468:469: assign tri_out = read_en ? out_data : ZZZ;470:471: always wait (reset)472: begin473: disable validate_cycle;474: disable rollback_cycle;475: disable find_cycle;476: disable input_cycle;477: disable output_cycle;478: out_seq = XXX;479: out_addr = XXX;480: out_data = XXX;481: out_wait = XXX;482: match = 0;483: in_ack = 0;484: out_req = 0;485: val_ack = 0;486: read_ack = 0;487: halt_ack = 0;488: roll_ack = 0;489: valid = 0;490: wait (˜reset);491: end492:493: always wait (val_req & ˜reset) // not same time as rollback494: begin :validate_cycle // clear "wait" bit495: #1;496: val_ack = 1;497: wait (˜val_req);

180

memdwb.v

498: #1;499: if (out_seq == val_seq)500: out_wait = 0;501: #1;502: val_ack = 0;503: end504:505: always wait (halt_req & ˜reset)506: begin :rollback_cycle507: #1;508: halt_ack = 1;509: wait (roll_req);510: #1;511: disable find_cycle;512: disable input_cycle;513: disable output_cycle;514: match = 0;515: in_ack = 0;516: out_req = 0;517: read_ack = 0;518:519: #DLY_SEQ_COMP;520: diff = out_seq - val_seq; // compare sequence numbers521: if (˜diff [SEQ_WIDTH-1])522: begin523: valid = 0;524: out_seq = XXX;525: out_addr = XXX;526: out_data = XXX;527: out_wait = XXX;528: end529:530: roll_ack = 1;531: fork532: begin533: wait (˜roll_req);534: #1;535: roll_ack = 0;536: end537: begin538: wait (˜halt_req);539: #1;540: halt_ack = 0;541: end542: join543: end544:545: always wait (read_req & ˜halt_req & ˜reset) // find uncommitted data546: begin :find_cycle547: #1;548: match = valid & (out_addr == read_addr);549: #1;550: read_ack = 1;551: wait (˜read_req);552: #1;553: match = 0;554: read_ack = 0;555: end556:557: always wait (in_req & ˜valid & ˜val_req & ˜read_req & ˜halt_req & ˜reset)558: begin :input_cycle559: #1;560: wait (˜val_req & ˜val_ack & ˜read_req & ˜halt_req);561: out_seq = in_seq;562: out_wait = in_wait;563: out_addr = in_addr;564: out_data = in_data;565: in_ack = 1;566: valid = 1;567: wait (˜in_req);568: #1;

181

memdwb.v

569: wait (˜val_req & ˜val_ack & ˜read_req & ˜halt_req);570: in_ack = 0;571: end572:573: always wait (valid & ˜val_req & ˜read_req & ˜halt_req & ˜reset)574: begin :output_cycle575: #1;576: wait (˜val_req & ˜val_ack & ˜read_req & ˜halt_req);577: out_req = 1;578: wait (out_ack);579: #1;580: valid = 0;581: wait (˜val_req & ˜val_ack & ˜read_req & ˜halt_req);582: out_req = 0;583: wait (˜out_ack);584: end585:586: endmodule // memdwb_buf

182

pc.v

1: // Program Counter2:3: module pc (pc_load, in_retry, load_req, load_ack,4: pc_out, out_retry, out_req, out_ack, reset);5:6: ‘include "parameter"7:8: input [ADDR_WIDTH:0] pc_load;9: output [ADDR_WIDTH:0] pc_out;10: input in_retry, load_req, out_ack, reset;11: output out_retry, load_ack, out_req;12:13: reg [ADDR_WIDTH:0] pc_out;14: reg out_retry, load_ack, out_req;15:16: reg [5:0] loop; // ADDR_WIDTH = 32 bits max17: reg parity;18:19: always wait (reset)20: begin21: disable increment_cycle;22: disable load_cycle;23: out_retry = 0;24: load_ack = 0;25: out_req = 0;26: pc_out = 0;27: wait (˜reset);28: out_req = 1;29: end30:31: always wait (load_req & ˜reset)32: begin :load_cycle33: #1;34: disable increment_cycle;35: out_req = 0;36: pc_out = {pc_load[ADDR_WIDTH:ADDR_IGNORE], {ADDR_IGNORE{1’b0}}};37: out_retry = in_retry; // rollback flag38: load_ack = 1;39: wait (˜load_req);40: #1;41: load_ack = 0;42: out_req = 1;43: end44:45: always wait (out_ack & ˜load_req & ˜reset)46: begin :increment_cycle47: #1;48: out_req = 0;49: #DLY_PC_INC;50: pc_out = pc_out + ADDR_INC;51: out_retry = 0; // reset rollback flag52:53: parity = 0;54: for (loop=0; loop<ADDR_WIDTH; loop=loop+1) // calculate parity55: parity = parity ˆ pc_out [loop];56: pc_out [ADDR_WIDTH] = parity;57:58: wait (˜out_ack);59: #1;60: out_req = 1;61: end62:63: endmodule // pc

183

regdwb.v

1: // Delayed Write Buffer for Register File: Note initial wait bit = 12:3: module regdwb (in_seq, in_reg, in_data, in_req, in_ack,4: out_seq, out_reg, out_data, chk_req, chk_ack, clr_req, clr_ack,5: val_seq, val_req, val_ack,6: rd_reg1, rd_reg2, sim_f, tri_data1, tri_data2, rd_req, rd_ack,7: rf_data1, rf_data2, rdf_req, rdf_ack,8: wrf_reg, wrf_data, wrf_req, wrf_ack,9: halt_req, halt_ack, roll_req, roll_ack, reset);10:11: ‘include "parameter"12:13: input [SEQ_WIDTH-1:0] in_seq, val_seq;14: input [REG_WIDTH-1:0] in_reg, rd_reg1, rd_reg2;15: input [DATA_WIDTH:0] in_data, rf_data1, rf_data2;16: output [SEQ_WIDTH-1:0] out_seq;17: output [REG_WIDTH-1:0] out_reg, wrf_reg;18: output [DATA_WIDTH:0] out_data, tri_data1, tri_data2, wrf_data;19: input sim_f, reset;20: input in_req, chk_ack, clr_ack, val_req, rd_req, rdf_ack, wrf_ack,21: halt_req, roll_req;22: output in_ack, chk_req, clr_req, val_ack, rd_ack, rdf_req, wrf_req,23: halt_ack, roll_ack;24:25: reg [SEQ_WIDTH-1:0] out_seq;26: reg [REG_WIDTH-1:0] out_reg, wrf_reg;27: reg [DATA_WIDTH:0] out_data, wrf_data;28: reg in_ack, chk_req, clr_req, rd_ack, wrf_req, halt_ack, roll_ack;29:30: reg [DATA_WIDTH:0] out_data1, out_data2;31: reg [SEQ_WIDTH-1:0] diff;32: reg go_check, go_clear, inq_req, com_ack, haltc_ack, rollc_ack;33:34: tri [DATA_WIDTH:0] tri_out1, tri_out2;35: wire [REG_WIDTH-1:0] com_reg;36: wire [DATA_WIDTH:0] com_data;37: wire rdf_req = rd_req; // route read request38:39: regdwbq regdwbq (in_seq, 1’b1, in_reg, in_data, inq_req, inq_ack,40: com_wait, com_reg, com_data, com_req, com_ack,41: val_seq, val_req, val_ack,42: rd_reg1, rd_reg2, rd_req, rdq_ack,43: match1, match2, tri_out1, tri_out2,44: halt_req, haltq_ack, roll_req, rollq_ack, reset);45:46: assign tri_data1 = rd_req ? out_data1 : ZZZ;47: assign tri_data2 = rd_req ? out_data2 : ZZZ;48:49: always wait (reset)50: begin51: disable rollback_cycle;52: disable input_cycle;53: disable check_cycle;54: disable clear_cycle;55: disable commit_cycle;56: disable read_cycle;57: out_seq = XXX;58: out_reg = XXX;59: wrf_reg = XXX;60: out_data = XXX;61: wrf_data = XXX;62: in_ack = 0;63: chk_req = 0;64: clr_req = 0;65: rd_ack = 0;66: wrf_req = 0;67: halt_ack = 0;68: roll_ack = 0;69: out_data1 = XXX;70: out_data2 = XXX;71: go_check = 0;

184

regdwb.v

72: go_clear = 0;73: inq_req = 0;74: com_ack = 0;75: haltc_ack = 0;76: rollc_ack = 0;77: wait (˜reset);78: end79:80: always @(haltq_ack or haltc_ack)81: if (haltq_ack & haltc_ack)82: halt_ack = 1;83: else if (˜haltq_ack & ˜haltc_ack)84: halt_ack = 0;85:86: always @(rollq_ack or rollc_ack)87: if (rollq_ack & rollc_ack)88: roll_ack = 1;89: else if (˜rollq_ack & ˜rollc_ack)90: roll_ack = 0;91:92: always wait (halt_req & ˜reset)93: begin :rollback_cycle94: #1;95: haltc_ack = 1;96: wait (roll_req);97: #1;98: disable input_cycle;99: disable check_cycle;100: disable clear_cycle;101: disable commit_cycle;102: disable read_cycle;103: wrf_reg = XXX;104: wrf_data = XXX;105: in_ack = 0;106: chk_req = 0;107: clr_req = 0;108: rd_ack = 0;109: wrf_req = 0;110: out_data1 = XXX;111: out_data2 = XXX;112: inq_req = 0;113: com_ack = 0;114:115: #DLY_SEQ_COMP;116: diff = out_seq - val_seq; // compare sequence numbers117: if (˜diff [SEQ_WIDTH-1])118: begin119: go_check = 0;120: go_clear = 0;121: out_seq = XXX;122: out_reg = XXX;123: out_data = XXX;124: end125:126: rollc_ack = 1;127: fork128: begin129: wait (˜roll_req);130: #1;131: rollc_ack = 0;132: end133: begin134: wait (˜halt_req);135: #1;136: haltc_ack = 0;137: end138: join139: end140:141: always wait (in_req & ˜go_check & ˜go_clear & ˜halt_req & ˜reset)142: begin :input_cycle

185

regdwb.v

143: #1;144: wait (˜halt_req);145: out_seq = in_seq;146: out_reg = in_reg;147: out_data = in_data;148: inq_req = 1;149: wait (inq_ack); // wait until actually accepted150: #1;151: in_ack = 1;152: go_check = 1;153: go_clear = 1;154: wait (˜in_req);155: #1;156: wait (˜halt_req);157: inq_req = 0;158: wait (˜inq_ack);159: #1;160: in_ack = 0;161: end162:163: always wait (go_check & ˜halt_req & ˜reset)164: begin :check_cycle165: #1;166: wait (˜halt_req);167: chk_req = 1;168: wait (chk_ack);169: #1;170: go_check = 0;171: wait (˜halt_req);172: chk_req = 0;173: wait (˜chk_ack);174: end175:176: always wait (go_clear & ˜halt_req & ˜reset)177: begin :clear_cycle178: #1;179: wait (˜halt_req);180: clr_req = 1;181: wait (clr_ack);182: #1;183: go_clear = 0;184: wait (˜halt_req);185: clr_req = 0;186: wait (˜clr_ack);187: end188:189: always wait (com_req & ˜halt_req & ˜reset)190: begin :commit_cycle191: wait (˜com_wait); // wait until validated192: #1;193: wrf_reg = com_reg;194: wrf_data = com_data;195: #1;196: wait (˜halt_req);197: wrf_req = 1; // write to register file198: wait (wrf_ack);199: #1;200: wait (˜rd_req & ˜halt_req);201: com_ack = 1; // delete from queue202: wrf_req = 0;203: wait (˜com_req);204: #1;205: wait (˜rd_req & ˜halt_req);206: com_ack = 0;207: wait (˜wrf_ack);208: end209:210: always wait (rdq_ack & rdf_ack & ˜halt_req & ˜reset)211: begin :read_cycle212: #1;213: out_data1 = match1 ? tri_out1 : rf_data1;

186

regdwb.v

214: out_data2 = match2 ? tri_out2 : rf_data2;215: if (sim_f) // simulate fault216: begin217: out_data1 [DATA_WIDTH] = ˜out_data1 [DATA_WIDTH];218: out_data2 [DATA_WIDTH] = ˜out_data2 [DATA_WIDTH];219: end220: #1;221: rd_ack = 1;222: wait (˜rdq_ack & ˜rdf_ack);223: #1;224: rd_ack = 0;225: end226:227: endmodule // regdwb228:229: // ====================================================================230: // REG_DWB Queue: Note that sequence number is not used in the output.231:232: module regdwbq (in_seq, in_wait, in_reg, in_data, in_req, in_ack,233: out_wait, out_reg, out_data, out_req, out_ack,234: val_seq, val_req, val_ack,235: read_reg1, read_reg2, read_req, read_ack,236: match1, match2, tri_out1, tri_out2,237: halt_req, halt_ack, roll_req, roll_ack, reset);238:239: ‘include "parameter"240:241: input [SEQ_WIDTH-1:0] in_seq, val_seq;242: input [REG_WIDTH-1:0] in_reg, read_reg1, read_reg2;243: input [DATA_WIDTH:0] in_data;244: output [REG_WIDTH-1:0] out_reg;245: output [DATA_WIDTH:0] out_data, tri_out1, tri_out2;246: input in_wait, reset;247: output out_wait, match1, match2;248: input in_req, out_ack, val_req, read_req, halt_req, roll_req;249: output in_ack, out_req, val_ack, read_ack, halt_ack, roll_ack;250:251: reg val_ack, read_ack, halt_ack, roll_ack;252:253: reg read_ready, read_en11, read_en21, read_en31, read_en41, read_en12,254: read_en22, read_en32, read_en42, haltr_ack, rollr_ack;255:256: wire [SEQ_WIDTH-1:0] seq1, seq2, seq3, not_used;257: wire [REG_WIDTH-1:0] reg1, reg2, reg3;258: wire [DATA_WIDTH:0] data1, data2, data3;259: wire match11, match21, match31, match41, match12, match22, match32, match42;260:261: assign match1 = match11 | match21 | match31 | match41;262: assign match2 = match12 | match22 | match32 | match42;263:264:265: regdwb_buf buf1 (in_seq, in_wait, in_reg, in_data, in_req, in_ack,266: seq1, wait1, reg1, data1, req1, ack1,267: val_seq, val_req, val_ack1,268: read_reg1, read_reg2, read_req, read_ack1,269: match11, match12, read_en11, read_en12, tri_out1, tri_out2,270: halt_req, halt_a1, roll_req, roll_a1, reset);271:272: regdwb_buf buf2 (seq1, wait1, reg1, data1, req1, ack1,273: seq2, wait2, reg2, data2, req2, ack2,274: val_seq, val_req, val_ack2,275: read_reg1, read_reg2, read_req, read_ack2,276: match21, match22, read_en21, read_en22, tri_out1, tri_out2,277: halt_req, halt_a2, roll_req, roll_a2, reset);278:279: regdwb_buf buf3 (seq2, wait2, reg2, data2, req2, ack2,280: seq3, wait3, reg3, data3, req3, ack3,281: val_seq, val_req, val_ack3,282: read_reg1, read_reg2, read_req, read_ack3,283: match31, match32, read_en31, read_en32, tri_out1, tri_out2,284: halt_req, halt_a3, roll_req, roll_a3, reset);

187

regdwb.v

285:286: regdwb_buf buf4 (seq3, wait3, reg3, data3, req3, ack3,287: not_used, out_wait, out_reg, out_data, out_req, out_ack,288: val_seq, val_req, val_ack4,289: read_reg1, read_reg2, read_req, read_ack4,290: match41, match42, read_en41, read_en42, tri_out1, tri_out2,291: halt_req, halt_a4, roll_req, roll_a4, reset);292:293:294: always wait (reset)295: begin296: disable read_cycle;297: val_ack = 0;298: read_ack = 0;299: read_ready = 0;300: halt_ack = 0;301: roll_ack = 0;302: read_en11 = 0;303: read_en21 = 0;304: read_en31 = 0;305: read_en41 = 0;306: read_en12 = 0;307: read_en22 = 0;308: read_en32 = 0;309: read_en42 = 0;310: haltr_ack = 0;311: rollr_ack = 0;312: wait (˜reset);313: end314:315: always @(val_ack1 or val_ack2 or val_ack3 or val_ack4)316: if (val_ack1 & val_ack2 & val_ack3 & val_ack4)317: val_ack = 1;318: else if (˜val_ack1 & ˜val_ack2 & ˜val_ack3 & ˜val_ack4)319: val_ack = 0;320:321: always @(read_ack1 or read_ack2 or read_ack3 or read_ack4)322: if (read_ack1 & read_ack2 & read_ack3 & read_ack4)323: read_ready = 1;324: else if (˜read_ack1 & ˜read_ack2 & ˜read_ack3 & ˜read_ack4)325: read_ready = 0;326:327: always @(halt_a1 or halt_a2 or halt_a3 or halt_a4 or haltr_ack)328: if (halt_a1 & halt_a2 & halt_a3 & halt_a4 & haltr_ack)329: halt_ack = 1;330: else if (˜halt_a1 & ˜halt_a2 & ˜halt_a3 & ˜halt_a4 & ˜haltr_ack)331: halt_ack = 0;332:333: always @(roll_a1 or roll_a2 or roll_a3 or roll_a4 or rollr_ack)334: if (roll_a1 & roll_a2 & roll_a3 & roll_a4 & rollr_ack)335: roll_ack = 1;336: else if (˜roll_a1 & ˜roll_a2 & ˜roll_a3 & ˜roll_a4 & ˜rollr_ack)337: roll_ack = 0;338:339: always wait (halt_req & ˜reset)340: begin :rollback_cycle341: #1;342: haltr_ack = 1;343: wait (roll_req);344: #1;345: disable read_cycle;346: read_ready = 0;347: read_ack = 0;348: read_en11 = 0;349: read_en21 = 0;350: read_en31 = 0;351: read_en41 = 0;352: read_en12 = 0;353: read_en22 = 0;354: read_en32 = 0;355: read_en42 = 0;

188

regdwb.v

356:357: rollr_ack = 1;358: fork359: begin360: wait (˜roll_req);361: #1;362: rollr_ack = 0;363: end364: begin365: wait (˜halt_req);366: #1;367: haltr_ack = 0;368: end369: join370: end371:372: always wait (read_ready & ˜halt_req & ˜reset)373: begin :read_cycle374: #1;375: casex ({match11, match21, match31, match41}) // priority decoder376: 4’b1???: read_en11 = 1;377: 4’b01??: read_en21 = 1;378: 4’b001?: read_en31 = 1;379: 4’b0001: read_en41 = 1;380: endcase381: casex ({match12, match22, match32, match42})382: 4’b1???: read_en12 = 1;383: 4’b01??: read_en22 = 1;384: 4’b001?: read_en32 = 1;385: 4’b0001: read_en42 = 1;386: endcase387: #1;388: read_ack = 1;389: wait (˜read_ready);390: #1;391: read_en11 = 0;392: read_en21 = 0;393: read_en31 = 0;394: read_en41 = 0;395: read_en12 = 0;396: read_en22 = 0;397: read_en32 = 0;398: read_en42 = 0;399: read_ack = 0;400: end401:402: endmodule // regdwbq403:404: // ====================================================================405:406: module regdwb_buf (in_seq, in_wait, in_reg, in_data, in_req, in_ack,407: out_seq, out_wait, out_reg, out_data, out_req, out_ack,408: val_seq, val_req, val_ack,409: read_reg1, read_reg2, read_req, read_ack,410: match1, match2, read_en1, read_en2, tri_out1, tri_out2,411: halt_req, halt_ack, roll_req, roll_ack, reset);412:413: ‘include "parameter"414:415: input [SEQ_WIDTH-1:0] in_seq, val_seq;416: input [REG_WIDTH-1:0] in_reg, read_reg1, read_reg2;417: input [DATA_WIDTH:0] in_data;418: output [SEQ_WIDTH-1:0] out_seq;419: output [REG_WIDTH-1:0] out_reg;420: output [DATA_WIDTH:0] out_data, tri_out1, tri_out2;421: input in_wait, read_en1, read_en2, reset;422: output out_wait, match1, match2;423: input in_req, out_ack, val_req, read_req, halt_req, roll_req;424: output in_ack, out_req, val_ack, read_ack, halt_ack, roll_ack;425:426: reg [SEQ_WIDTH-1:0] out_seq;

189

regdwb.v

427: reg [REG_WIDTH-1:0] out_reg;428: reg [DATA_WIDTH:0] out_data;429: reg out_wait, match1, match2;430: reg in_ack, out_req, val_ack, read_ack, halt_ack, roll_ack;431:432: reg [SEQ_WIDTH-1:0] diff;433: reg valid;434:435: assign tri_out1 = read_en1 ? out_data : ZZZ;436: assign tri_out2 = read_en2 ? out_data : ZZZ;437:438: always wait (reset)439: begin440: disable validate_cycle;441: disable rollback_cycle;442: disable find_cycle;443: disable input_cycle;444: disable output_cycle;445: out_seq = XXX;446: out_reg = XXX;447: out_data = XXX;448: out_wait = XXX;449: match1 = 0;450: match2 = 0;451: in_ack = 0;452: out_req = 0;453: val_ack = 0;454: read_ack = 0;455: halt_ack = 0;456: roll_ack = 0;457: valid = 0;458: wait (˜reset);459: end460:461: always wait (val_req & ˜reset) // not same time as rollback462: begin :validate_cycle // clear "wait" bit463: #1;464: val_ack = 1;465: wait (˜val_req);466: #1;467: if (out_seq == val_seq)468: out_wait = 0;469: #1;470: val_ack = 0;471: end472:473: always wait (halt_req & ˜reset)474: begin :rollback_cycle475: #1;476: halt_ack = 1;477: wait (roll_req);478: #1;479: disable find_cycle;480: disable input_cycle;481: disable output_cycle;482: match1 = 0;483: match2 = 0;484: in_ack = 0;485: out_req = 0;486: read_ack = 0;487:488: #DLY_SEQ_COMP;489: diff = out_seq - val_seq; // compare sequence numbers490: if (˜diff [SEQ_WIDTH-1])491: begin492: valid = 0;493: out_seq = XXX;494: out_reg = XXX;495: out_data = XXX;496: out_wait = XXX;497: end

190

regdwb.v

498:499: roll_ack = 1;500: fork501: begin502: wait (˜roll_req);503: #1;504: roll_ack = 0;505: end506: begin507: wait (˜halt_req);508: #1;509: halt_ack = 0;510: end511: join512: end513:514: always wait (read_req & ˜halt_req & ˜reset) // find uncommitted data515: begin :find_cycle516: #1; // do NOT match R0 !!!517: match1 = valid & (out_reg == read_reg1) & (read_reg1 != 0);518: match2 = valid & (out_reg == read_reg2) & (read_reg2 != 0);519: #1;520: read_ack = 1;521: wait (˜read_req);522: #1;523: match1 = 0;524: match2 = 0;525: read_ack = 0;526: end527:528: always wait (in_req & ˜valid & ˜val_req & ˜read_req & ˜halt_req & ˜reset)529: begin :input_cycle530: #1;531: wait (˜val_req & ˜val_ack & ˜read_req & ˜halt_req);532: out_seq = in_seq;533: out_wait = in_wait;534: out_reg = in_reg;535: out_data = in_data;536: in_ack = 1;537: valid = 1;538: wait (˜in_req);539: #1;540: wait (˜val_req & ˜val_ack & ˜read_req & ˜halt_req);541: in_ack = 0;542: end543:544: always wait (valid & ˜val_req & ˜read_req & ˜halt_req & ˜reset)545: begin :output_cycle546: #1;547: wait (˜val_req & ˜val_ack & ˜read_req & ˜halt_req);548: out_req = 1;549: wait (out_ack);550: #1;551: valid = 0;552: wait (˜val_req & ˜val_ack & ˜read_req & ˜halt_req);553: out_req = 0;554: wait (˜out_ack);555: end556:557: endmodule // regdwb_buf

191

regfile.v

1: // Register File (3-port: 1 write, 2 read)2:3: module regfile (wr_reg, wr_data, wr_req, wr_ack,4: rd_reg1, rd_reg2, rd_data1, rd_data2, rd_req, rd_ack,5: halt_req, halt_ack, roll_req, roll_ack, reset);6:7: ‘include "parameter"8:9: input [REG_WIDTH-1:0] wr_reg, rd_reg1, rd_reg2;10: input [DATA_WIDTH:0] wr_data;11: output [DATA_WIDTH:0] rd_data1, rd_data2;12: input wr_req, rd_req, halt_req, roll_req, reset;13: output wr_ack, rd_ack, halt_ack, roll_ack;14:15: reg [DATA_WIDTH:0] rd_data1, rd_data2, rfile [0:REG_SIZE-1];16: reg wr_ack, rd_ack, halt_ack, roll_ack;17: reg [REG_WIDTH:0] loop; // one extra bit for loop termination18:19: always wait (reset)20: begin21: disable rollback_cycle;22: disable write_cycle;23: disable read_cycle;24: rd_data1 = XXX;25: rd_data2 = XXX;26: wr_ack = 0;27: rd_ack = 0;28: halt_ack = 0;29: roll_ack = 0;30: for (loop = 0; loop < REG_SIZE; loop = loop + 1)31: rfile [loop] = XXX;32: wait (˜reset);33: end34:35: always wait (halt_req & ˜reset)36: begin :rollback_cycle37: #1;38: halt_ack = 1;39: wait (roll_req);40: #1;41: disable write_cycle;42: disable read_cycle;43: rd_data1 = XXX;44: rd_data2 = XXX;45: wr_ack = 0;46: rd_ack = 0;47:48: roll_ack = 1;49: fork50: begin51: wait (˜roll_req);52: #1;53: roll_ack = 0;54: end55: begin56: wait (˜halt_req);57: #1;58: halt_ack = 0;59: end60: join61: end62:63: always wait (wr_req & ˜halt_req & ˜reset)64: begin :write_cycle65: #DLY_RF_WR;66: wait (˜halt_req);67: rfile [wr_reg] = wr_data;68: wr_ack = 1;69: wait (˜wr_req);70: #1;71: wait (˜halt_req);

192

regfile.v

72: wr_ack = 0;73: end74:75: always wait (rd_req & ˜halt_req & ˜reset)76: begin :read_cycle77: #DLY_RF_RD;78: rd_data1 = rd_reg1 ? rfile [rd_reg1] : 0;79: rd_data2 = rd_reg2 ? rfile [rd_reg2] : 0;80: #1;81: wait (˜halt_req);82: rd_ack = 1;83: wait (˜rd_req);84: #1;85: wait (˜halt_req);86: rd_ack = 0;87: end88:89: endmodule // regfile

193

restable.v

1: // Reservation Table (for registers)2:3: module restable (seq_num, reg_w, reg_r1, reg_r2, res_req, res_ack,4: reg_clr, clr_req, clr_ack,5: val_seq, halt_req, halt_ack, roll_req, roll_ack, reset);6:7: ‘include "parameter"8:9: input [SEQ_WIDTH-1:0] seq_num, val_seq;10: input [REG_WIDTH-1:0] reg_w, reg_r1, reg_r2, reg_clr;11: input res_req, clr_req, halt_req, roll_req, reset;12: output res_ack, clr_ack, halt_ack, roll_ack;13:14: reg [SEQ_WIDTH-1:0] seq_table [0:REG_SIZE-1];15: reg res_table [0:REG_SIZE-1];16: reg res_ack, clr_ack, halt_ack, roll_ack;17:18: reg [SEQ_WIDTH-1:0] diff;19: reg [REG_WIDTH:0] loop; // one extra bit for loop termination20:21: always wait (reset)22: begin23: disable rollback_cycle;24: disable reserve_cycle;25: disable clear_cycle;26: res_ack = 0;27: clr_ack = 0;28: halt_ack = 0;29: roll_ack = 0;30: for (loop = 0; loop < REG_SIZE; loop = loop + 1)31: res_table [loop] = 0;32: wait (˜reset);33: end34:35: always wait (halt_req & ˜reset)36: begin :rollback_cycle37: #1;38: halt_ack = 1;39: wait (roll_req);40: #1;41: disable reserve_cycle;42: disable clear_cycle;43: res_ack = 0;44: clr_ack = 0;45:46: #DLY_SEQ_COMP;47: for (loop = 1; loop < REG_SIZE; loop = loop + 1)48: begin49: diff = seq_table [loop] - val_seq;50: if (˜diff[SEQ_WIDTH-1])51: res_table [loop] = 0;52: end53:54: roll_ack = 1;55: fork56: begin57: wait (˜roll_req);58: #1;59: roll_ack = 0;60: end61: begin62: wait (˜halt_req);63: #1;64: halt_ack = 0;65: end66: join67: end68:69: always wait (res_req & ˜halt_req & ˜reset)70: begin :reserve_cycle71: #DLY_RT_RES;

194

restable.v

72: wait (˜res_table [reg_w] & ˜res_table [reg_r1] &73: ˜res_table [reg_r2]);74: #1;75: wait (˜halt_req);76: res_table [reg_w] = 1 & (reg_w != 0); // R0 always available77: seq_table [reg_w] = seq_num;78: res_ack = 1;79: wait (˜res_req);80: #1;81: wait (˜halt_req);82: res_ack = 0;83: end84:85: always wait (clr_req & ˜halt_req & ˜reset)86: begin :clear_cycle87: #DLY_RT_CLR;88: wait (˜halt_req);89: res_table [reg_clr] = 0;90: clr_ack = 1;91: wait (˜clr_req);92: #1;93: wait (˜halt_req);94: clr_ack = 0;95: end96:97: endmodule // restable

195

Appendix BAMPIRE Assembler

B.1. Assembly Code Format

The assembly code has the following format. Each part is optional and case-

insensitive.

[label:] [*][instruction] [;comment]

The instruction set is shown in Table 3.1. Each register field is specified as R0 to

R31. The immediate/offset field may be a decimal number, a hexdecimal number (with

’H’ postfix), or a label. A number may also have the ’#’ prefix as shown in [Henn90],

but that character is simply ignored by the assembler. By adding the ’*’ prefix to any

instruction, a bad parity bit is generated for fault simulation.

B.2. Assembler Source Code

The C source code starts on the next page.

196

asm.c

1: /* AMPIRE Assembler */2:3: /* INST_WIDTH must be less than or equal to 32 bits. Even parity is used. */4:5: #include <stdio.h>6: #include <string.h>7:8: #define OP_WIDTH 6 /* bits for opcode field */9: #define REG_WIDTH 5 /* bits for register number */10: #define EXTRA_WIDTH 11 /* bits for "extra" field */11: #define ADDR_WIDTH 8 /* bits for address (for comment only) */12: #define ADDR_INC 4 /* amount of address increment */13:14: #define IMM_WIDTH REG_WIDTH + EXTRA_WIDTH15: #define OFFSET_WIDTH 2 * REG_WIDTH + IMM_WIDTH16: #define INST_WIDTH OP_WIDTH + OFFSET_WIDTH17:18: #define MAX_OPS 50 /* max number of opcodes */19: #define MAX_LINE 100 /* max number of characters per line */20: #define MAX_LABELS 50 /* max number of labels */21: #define MAX_LEN 20 /* max length of labels */22: #define SPACE ’ ’23: #define EOL ’024:25: #define F_NONE 0 /* instruction format codes */26: #define F_ALU 127: #define F_ALUI 228: #define F_LHI 329: #define F_JREG 430: #define F_BRANCH 531: #define F_OFFSET 632: #define F_LOAD 733: #define F_STORE 834: #define F_DATA 935:36: #define T_ABS 0 /* absolute address type */37: #define T_REL 1 /* PC relative type */38:39: char *opname[MAX_OPS]; /* opcode name database */40: int opnum[MAX_OPS]; /* opcode number database */41: int extra[MAX_OPS]; /* extra field database */42: char format[MAX_OPS]; /* instruction format codes */43: int opcount; /* number of opcodes */44:45: char label[MAX_LABELS][MAX_LEN]; /* label name database */46: int addr[MAX_LABELS]; /* label address database */47: int label_count = 0; /* number of labels */48:49: int cur_addr; /* current address */50: int cur_line; /* current line in input file */51: char parity; /* used to create good/bad parity */52: FILE *infile, *outfile;53:54: /* ============================================================== */55:56: main (argc, argv)57: int argc;58: char *argv[];59: {60: extern char format[MAX_OPS];61: extern int label_count, cur_addr, cur_line;62: extern FILE *infile, *outfile;63:64: char org_line[MAX_LINE], work_line[MAX_LINE];65: int pointer, opindex, rd, rs, rt, imm, offset, data;66:67: if (argc != 2) {68: printf ("Usage: asm source_file (output: inst.hex)0);69: exit (1);70: }71:

197

asm.c

72: define_opcodes (); /* define opcode database */73:74: printf ("first pass ...0);75: infile = fopen (argv[1], "r"); /* open input file */76: if (infile == NULL) {77: printf ("*** source file error ***0);78: exit (1);79: }80:81: cur_line = 0;82: cur_addr = 0;83: while (get_line(org_line) != EOF) { /* 1st pass -- scan for labels */84: cur_line++;85: strcpy (work_line, org_line);86: filter_line (work_line);87: pointer = 0;88: opindex = get_opindex (work_line, &pointer);89:90: if (opindex == -2) {91: add_label (work_line); /* add to label database */92: opindex = get_opindex (work_line, &pointer);93: }94:95: if (opindex >= 0)96: cur_addr += ADDR_INC; /* real instruction */97: } /* while */98: fclose (infile);99:100: printf ("%d out of %d label database slots are used.0, label_count,101: MAX_LABELS);102: printf ("second pass ...0);103:104: infile = fopen (argv[1], "r"); /* open input file */105: if (infile == NULL) {106: printf ("*** source file error ***0);107: exit (1);108: }109:110: outfile = fopen ("inst.hex", "w"); /* open output file */111: if (outfile == NULL) {112: printf ("*** output file error ***0);113: exit (1);114: }115:116: cur_line = 0;117: cur_addr = 0;118: while (get_line(org_line) != EOF) { /* 2nd pass -- assemble codes */119: cur_line++;120: strcpy (work_line, org_line);121: filter_line (work_line);122: pointer = 0;123: opindex = get_opindex (work_line, &pointer);124:125: if (opindex == -2) /* skip labels */126: opindex = get_opindex (work_line, &pointer);127:128: if (opindex >= 0) { /* different instruction formats */129: rs = 0;130: rt = 0;131: rd = 0;132: imm = 0;133: offset = 0;134:135: switch (format[opindex]) {136: case F_NONE: {137: break;138: }139: case F_ALU: {140: rd = get_reg (work_line, &pointer);141: rs = get_reg (work_line, &pointer);142: rt = get_reg (work_line, &pointer);

198

asm.c

143: break;144: }145: case F_ALUI: {146: rt = get_reg (work_line, &pointer);147: rs = get_reg (work_line, &pointer);148: imm = get_num (work_line, &pointer, IMM_WIDTH,149: T_ABS);150: break;151: }152: case F_LHI: {153: rt = get_reg (work_line, &pointer);154: imm = get_num (work_line, &pointer, IMM_WIDTH,155: T_ABS);156: break;157: }158: case F_JREG: {159: rs = get_reg (work_line, &pointer);160: break;161: }162: case F_BRANCH: {163: rs = get_reg (work_line, &pointer);164: imm = get_num (work_line, &pointer, IMM_WIDTH,165: T_REL);166: break;167: }168: case F_OFFSET: {169: offset = get_num (work_line, &pointer, OFFSET_WIDTH,170: T_REL);171: break;172: }173: case F_LOAD: {174: rt = get_reg (work_line, &pointer);175: imm = get_num (work_line, &pointer, IMM_WIDTH,176: T_ABS);177: rs = get_reg (work_line, &pointer);178: break;179: }180: case F_STORE: {181: imm = get_num (work_line, &pointer, IMM_WIDTH,182: T_ABS);183: rs = get_reg (work_line, &pointer);184: rt = get_reg (work_line, &pointer);185: break;186: }187: case F_DATA: {188: data = get_num (work_line, &pointer, INST_WIDTH,189: T_ABS);190: break;191: }192: } /* switch */193:194: check_line (work_line, pointer);195: } /* if */196:197: print_line (org_line, opindex, rd, rs, rt, imm, offset, data);198: } /* while */199:200: fclose (outfile);201: fclose (infile);202: printf ("199d Words, or %d Bytes0, cur_addr/ADDR_INC, cur_addr);203: } /* main */204:205: /* ============================================================== */206:207: get_line (line) /* get one line from file, return length or EOF */208: char line[];209: {210: extern int cur_line;211: extern FILE *infile;212:213: int index, letter;

199

asm.c

214:215: for (index=0; (letter=getc(infile)) != EOF && letter != EOL; index++)216: line[index] = letter;217: line[index] = NULL;218:219: if (index >= MAX_LINE) {220: printf ("(%d) Line Too Long:200s0, cur_line, line);221: exit (1);222: }223:224: if (letter == EOF && index == 0)225: return (EOF);226: else227: return (index);228: } /* function get_line */229:230: /* ============================================================== */231:232: /* Convert unneeded characters to spaces, uppercase to lowercase. */233:234: filter_line (line)235: char line[];236: {237: int index = 0;238: char letter;239:240: while ((letter=line[index]) != NULL) {241: if (letter==’’ || letter==’,’ || letter==’(’ || letter==’)’ ||242: letter==’#’)243: line[index] = SPACE;244: if (letter >= ’A’ && letter <= ’Z’)245: line[index] = letter - ’A’ + ’a’;246: index++;247: }248: } /* function filter_line */249:250: /* ============================================================== */251:252: /* If valid, return opcode index with pointer after the opcode.253: * If comment or blank line, return -1. If label, return -2.254: */255:256: get_opindex (line, index)257: char line[];258: int *index;259: {260: extern char *opname[MAX_OPS];261: extern int opcount, cur_line;262: extern char parity;263:264: int wpos = 0, opindex;265: char word[MAX_LINE];266:267: while (line[*index] == SPACE) /* find first non-space */268: (*index)++;269:270: if (line[*index] == ’;’ || line[*index] == NULL) /* comment/blank */271: return (-1);272:273: parity = 0;274: if (line[*index] == ’*’) { /* create bad parity */275: parity = 1;276: (*index)++;277: }278:279: while (line[*index] != SPACE && line[*index] != NULL) /* get 1st word */280: word[wpos++] = line[(*index)++];281: word[wpos] = NULL;282:283: if (word[wpos-1] == ’:’) /* label */284: return (-2);

200

asm.c

285:286: for (opindex=0; opindex<opcount; opindex++) /* find matching opcode */287: if (strcmp(word, opname[opindex]) == 0)288: break;289:290: if (opindex == opcount) { /* no match */291: printf ("(%d) Invalid Opcode:201s0, cur_line, line);292: exit (1);293: }294:295: return (opindex);296: } /* function get_opindex */297:298: /* ============================================================== */299:300: add_label (line) /* add label to database */301: char line[];302: {303: extern char label[MAX_LABELS][MAX_LEN];304: extern int addr[MAX_LABELS];305: extern int label_count, cur_line;306:307: int wpos = 0, index, loop;308: char word[MAX_LINE];309:310: if (label_count >= MAX_LABELS) {311: printf ("Label Database Full, %d Entries0, label_count);312: exit (1);313: }314:315: index = 0;316: while (line[index] == SPACE) /* find first non-space */317: index++;318:319: while (line[index] != ’:’) /* get label */320: word[wpos++] = line[index++];321: word[wpos] = NULL;322:323: if (wpos >= MAX_LEN) {324: printf ("(%d) Label Too Long:201s0, cur_line, line);325: exit (1);326: }327:328: for (loop=0; loop<label_count; loop++) /* see if already defined */329: if (strcmp(word, label[loop]) == 0) {330: printf ("(%d) Label Already Defined:201s0, cur_line, line);331: exit (1);332: }333:334: strcpy (label[label_count], word); /* add it */335: addr[label_count++] = cur_addr;336: } /* function add_label */337:338: /* ============================================================== */339:340: get_reg (line, index) /* return register number */341: char line[];342: int *index;343: {344: extern int cur_line;345:346: int wpos = 0, reg_num;347: char word[MAX_LINE];348:349: while (line[*index] == SPACE) /* skip spaces */350: (*index)++;351:352: while (line[*index] != SPACE && line[*index] != NULL) /* get one word */353: word[wpos++] = line[(*index)++];354: word[wpos] = NULL;355:

201

asm.c

356: if (word[0] != ’r’) {357: printf ("(%d) Register must start with ’r’:202s0, cur_line, line);358: exit (1);359: }360:361: if (sscanf (&word[1], "%d", &reg_num) < 1) {362: printf ("(%d) Invalid Register Format:202s0, cur_line, line);363: exit (1);364: }365:366: if (reg_num < 0 || reg_num > (power2(REG_WIDTH)-1)) {367: printf ("(%d) Register Number Out of Range:202s0, cur_line, line);368: exit (1);369: }370:371: return (reg_num);372: } /* function get_reg */373:374: /* ============================================================== */375:376: get_num (line, index, maxbits, type) /* return number */377: char line[];378: int *index, maxbits, type;379: {380: extern int cur_line;381:382: int wpos = 0, num;383: char word[MAX_LINE];384:385: while (line[*index] == SPACE) /* skip spaces */386: (*index)++;387:388: while (line[*index] != SPACE && line[*index] != NULL) /* get one word */389: word[wpos++] = line[(*index)++];390: word[wpos] = NULL;391:392: num = find_label (word); /* find address, -1 if not found */393: if (num >= 0) {394: if (type == T_REL)395: num = num - (cur_addr + ADDR_INC); /* PC relative */396: }397: else398: if (line[(*index)-1] == ’h’) {399: if (sscanf (word, "%x", &num) < 1) {400: printf ("(%d) Invalid Hexadecimal Format:202s0,401: cur_line, line);402: exit (1);403: }404: }405: else406: if (sscanf (word, "%d", &num) < 1) {407: printf ("(%d) Invalid Decimal Format:202s0,408: cur_line, line);409: exit (1);410: }411:412: if (num < -(power2(maxbits-1)) || num > (power2(maxbits-1)-1)) {413: printf ("(%d) %d-bit 2’s Complement Number Out of Range:202s0,414: cur_line, maxbits, line);415: exit (1);416: }417:418: if (num < 0) /* 2’s complement form */419: num = power2 (maxbits) + num;420: return (num);421: } /* function get_num */422:423: /* ============================================================== */424:425: find_label (word) /* return address, or -1 if not found */426: char word[];

202

asm.c

427: {428: extern char label[MAX_LABELS][MAX_LEN];429: extern int addr[MAX_LABELS];430: extern int label_count;431:432: int loop;433:434: for (loop=0; loop<label_count; loop++) /* search */435: if (strcmp(word, label[loop]) == 0)436: return (addr[loop]);437:438: return (-1); /* not found */439: } /* function find_label */440:441: /* ============================================================== */442:443: check_line (line, index) /* check the rest of the line */444: char line[];445: int index;446: {447: extern int cur_line;448:449: while (line[index] == SPACE) /* skip spaces */450: index++;451:452: if (line[index] != ’;’ && line[index] != NULL) {453: printf ("(%d) Too Many Operands/Comments without Semicolon:203s0,454: cur_line, line);455: exit (1);456: }457: } /* function check_line */458:459: /* ============================================================== */460:461: /* To generate correct even parity, ’parity’ should be 0 when this routine462: * is called. The bad parity flag is set in the get_opindex routine.463: */464:465: print_line (org_line, opindex, rd, rs, rt, imm, offset, data)466: char org_line[];467: int opindex, rd, rs, rt, imm, offset, data;468: {469: extern int opnum[MAX_OPS];470: extern int extra[MAX_OPS];471: extern char format[MAX_OPS];472: extern int cur_addr;473: extern char parity;474: extern FILE *outfile;475:476: unsigned int inst;477: int loop, inst_digits, addr_digits;478:479: inst_digits = (INST_WIDTH + 1) / 4; /* number of hex digits */480: if ((INST_WIDTH + 1) % 4 != 0)481: inst_digits++;482: addr_digits = ADDR_WIDTH / 4;483: if (ADDR_WIDTH % 4 != 0)484: addr_digits++;485:486: if (opindex >= 0) {487: if (format[opindex] == F_DATA)488: inst = data;489: else490: inst = opnum[opindex] * power2 (OFFSET_WIDTH) +491: rs * power2 (REG_WIDTH + IMM_WIDTH) +492: rt * power2 (IMM_WIDTH) +493: rd * power2 (EXTRA_WIDTH) +494: extra[opindex] + imm + offset;495:496: for (loop=0; loop<INST_WIDTH; loop++) /* find parity bit */497: if (inst & power2(loop))

203

asm.c

498: parity ˆ= 1; /* XOR */499:500: if (INST_WIDTH < 32) {501: inst += parity * power2 (INST_WIDTH);502: fprintf (outfile, "%.*X", inst_digits, inst);503: }504: else {505: fprintf (outfile, "%d", parity);506: fprintf (outfile, "%.*X", inst_digits-1, inst);507: }508:509: fprintf (outfile, " // ");510: fprintf (outfile, "%.*X", addr_digits, cur_addr);511: cur_addr += ADDR_INC;512: }513: else { /* no instruction */514: for (loop=0; loop<inst_digits; loop++)515: putc (SPACE, outfile);516: fprintf (outfile, " // ");517: for (loop=0; loop<addr_digits; loop++)518: putc (SPACE, outfile);519: }520:521: fprintf (outfile, "%s0, org_line);522: } /* function print_line */523:524: /* ============================================================== */525:526: power2 (exp) /* return power of 2 */527: int exp;528: {529: int result = 1, loop;530:531: for (loop=0; loop<exp; loop++)532: result = result * 2;533:534: return (result);535: } /* function power2 */536:537: /* ============================================================== */538:539: /* instruction format code:540: * NONE opcode541: * ALU opcode rd, rs, rt542: * ALUI opcode rt, rs, imm543: * LHI opcode rt, imm544: * JREG opcode rs545: * BRANCH opcode rs, imm546: * OFFSET opcode offset547: * LOAD opcode rt, imm(rs)548: * STORE opcode imm(rs), rt549: * DATA opcode data_word550: */551:552: define_opcodes ()553: {554: extern char *opname[MAX_OPS];555: extern int opnum[MAX_OPS];556: extern int extra[MAX_OPS];557: extern char format[MAX_OPS];558: extern int opcount;559:560: int x = 0;561:562: opname[x]="j"; opnum[x]=2; extra[x]=0; format[x++]=F_OFFSET;563: opname[x]="jal"; opnum[x]=3; extra[x]=0; format[x++]=F_OFFSET;564: opname[x]="beqz"; opnum[x]=4; extra[x]=0; format[x++]=F_BRANCH;565: opname[x]="bnez"; opnum[x]=5; extra[x]=0; format[x++]=F_BRANCH;566: opname[x]="addui"; opnum[x]=9; extra[x]=0; format[x++]=F_ALUI;567: opname[x]="subui"; opnum[x]=11; extra[x]=0; format[x++]=F_ALUI;568: opname[x]="andi"; opnum[x]=12; extra[x]=0; format[x++]=F_ALUI;

204

asm.c

569: opname[x]="ori"; opnum[x]=13; extra[x]=0; format[x++]=F_ALUI;570: opname[x]="xori"; opnum[x]=14; extra[x]=0; format[x++]=F_ALUI;571: opname[x]="lhi"; opnum[x]=15; extra[x]=0; format[x++]=F_LHI;572: opname[x]="trap"; opnum[x]=17; extra[x]=0; format[x++]=F_OFFSET;573: opname[x]="jr"; opnum[x]=18; extra[x]=0; format[x++]=F_JREG;574: opname[x]="jalr"; opnum[x]=19; extra[x]=0; format[x++]=F_JREG;575: opname[x]="slli"; opnum[x]=20; extra[x]=0; format[x++]=F_ALUI;576: opname[x]="srli"; opnum[x]=22; extra[x]=0; format[x++]=F_ALUI;577: opname[x]="srai"; opnum[x]=23; extra[x]=0; format[x++]=F_ALUI;578: opname[x]="seqi"; opnum[x]=24; extra[x]=0; format[x++]=F_ALUI;579: opname[x]="snei"; opnum[x]=25; extra[x]=0; format[x++]=F_ALUI;580: opname[x]="slti"; opnum[x]=26; extra[x]=0; format[x++]=F_ALUI;581: opname[x]="sgti"; opnum[x]=27; extra[x]=0; format[x++]=F_ALUI;582: opname[x]="slei"; opnum[x]=28; extra[x]=0; format[x++]=F_ALUI;583: opname[x]="sgei"; opnum[x]=29; extra[x]=0; format[x++]=F_ALUI;584: opname[x]="lw"; opnum[x]=35; extra[x]=0; format[x++]=F_LOAD;585: opname[x]="sw"; opnum[x]=43; extra[x]=0; format[x++]=F_STORE;586: opname[x]="adduif"; opnum[x]=49; extra[x]=0; format[x++]=F_ALUI;587: opname[x]="jrf"; opnum[x]=50; extra[x]=0; format[x++]=F_JREG;588: opname[x]="swf"; opnum[x]=51; extra[x]=0; format[x++]=F_STORE;589:590: opname[x]="nop"; opnum[x]=0; extra[x]=0; format[x++]=F_NONE;591: opname[x]="sll"; opnum[x]=0; extra[x]=4; format[x++]=F_ALU;592: opname[x]="srl"; opnum[x]=0; extra[x]=6; format[x++]=F_ALU;593: opname[x]="sra"; opnum[x]=0; extra[x]=7; format[x++]=F_ALU;594: opname[x]="adduf"; opnum[x]=0; extra[x]=25; format[x++]=F_ALU;595: opname[x]="addu"; opnum[x]=0; extra[x]=33; format[x++]=F_ALU;596: opname[x]="subu"; opnum[x]=0; extra[x]=35; format[x++]=F_ALU;597: opname[x]="and"; opnum[x]=0; extra[x]=36; format[x++]=F_ALU;598: opname[x]="or"; opnum[x]=0; extra[x]=37; format[x++]=F_ALU;599: opname[x]="xor"; opnum[x]=0; extra[x]=38; format[x++]=F_ALU;600: opname[x]="seq"; opnum[x]=0; extra[x]=40; format[x++]=F_ALU;601: opname[x]="sne"; opnum[x]=0; extra[x]=41; format[x++]=F_ALU;602: opname[x]="slt"; opnum[x]=0; extra[x]=42; format[x++]=F_ALU;603: opname[x]="sgt"; opnum[x]=0; extra[x]=43; format[x++]=F_ALU;604: opname[x]="sle"; opnum[x]=0; extra[x]=44; format[x++]=F_ALU;605: opname[x]="sge"; opnum[x]=0; extra[x]=45; format[x++]=F_ALU;606:607: opname[x]=".word"; opnum[x]=0; extra[x]=0; format[x++]=F_DATA;608:609: opcount = x;610: printf ("%d out of %d opcode database slots are used.0, x, MAX_OPS);611: } /* function define_data */

205

Bibliography

[Bell78] C. G. Bell, A. Kotok, T. N. Hastings, and R. Hill, ‘‘The Evolution of theDECsystem-10,’’ in Computer Engineering, a DEC View of HardwareSystems Design (Bell, Mudge, and McNamara), Digital Press, Bedford,MA (1978).

[Berk91] C. H. v. Berkel, ‘‘Beware the Isochronic Fork,’’ Nat. Lab. UnclassifiedReport UR 003/91, Philips Research Lab., Eindhoven, The Netherlands(1991).

[Burn87] S. M. Burns and A. J. Martin, ‘‘Syntax-Directed Translation ofConcurrent Programs into Self-Timed Circuits,’’ Advanced Research inVLSI: Proceedings of the Fifth MIT Conference, Cambridge, MA, pp.35-50 (March 1987).

[Cade91] Cadence, Verilog-XL Reference Manual, Cadence Design Systems, Inc.,Lowell, MA (1991).

[Cast82] X. Castillo, S. R. McConnel, and D. P. Siewiorek, ‘‘Derivation andCalibration of a Transient Error Reliability Model,’’ IEEE Transactionson Computers C-31(7), pp. 658-671 (July 1982).

[Ciac81] M. L. Ciacelli, ‘‘Fault Handling on the IBM 4341 Processor,’’ 11thFault Tolerant Computing Symposium, Portland, Maine, pp. 9-12 (June1981).

[Dall86] W. J. Dally and C. L. Seitz, ‘‘The Torus Routing Chip,’’ DistributedComputing 1(4), pp. 187-196 (October 1986).

[Fran83] E. H. Frank and R. F. Sproull, ‘‘A Self-Timed Static RAM,’’ ThirdCaltech Conference on VLSI, Pasadena, CA, pp. 275-285 (March1983).

[Henn90] J. L. Hennessy and D. A. Patterson, Computer Architecture: AQuantitative Approach,Morgan Kaufmann, San Mateo, CA (1990).

[Host91] L. B. Hostetler and B. Mirtich, ‘‘DLXsim ! A Simulator for DLX,’’Documentation in the DLX Simulator Software Package (May 1, 1991).

[Jaco90] G. M. Jacobs and R. W. Brodersen, ‘‘A Fully Asynchronous DigitalSignal Processor Using Self-Timed Circuits,’’ IEEE Journal of Solid-

206

State Circuits 25(6), pp. 1526-1537 (December 1990).

[Mart85] A. J. Martin, ‘‘The Design of a Self-Timed Circuit for DistributedMutual Exclusion,’’ 1985 Chapel Hill Conference on VLSI, Chapel Hill,NC, pp. 245-260 (March 1985).

[Mart86] A. J. Martin, ‘‘Compiling Communicating Processes into Delay-Insensitive VLSI Circuits,’’ Distributed Computing 1(4), pp. 226-234(October 1986).

[Mart89] A. J. Martin, S. M. Burns, T. K. Lee, D. Borkovic, and P. J.Hazewindus, ‘‘The First Asynchronous Microprocessor: The TestResults,’’ Computer Architecture News 17(4), pp. 95-110 (June 1989).

[Meng89] T. H.-Y. Meng, R. W. Brodersen, and D. G. Messerschmitt, ‘‘AutomaticSynthesis of Asynchronous Circuits from High-Level Specifications,’’IEEE Transactions on Computer-Aided Design 8(11), pp. 1185-1205(November 1989).

[Moln85] C. E. Molnar, T.-P. Fang, and F. U. Rosenberger, ‘‘Synthesis of Delay-Insensitive Modules,’’ 1985 Chapel Hill Conference on VLSI, ChapelHill, NC, pp. 67-86 (March 1985).

[Patt83] D. A. Patterson, P. Garrison, M. Hill, D. Lioupis, C. Nyberg, T. Sippel,and K. V. Dyke, ‘‘Architecture of a VLSI Instruction Cache For aRISC,’’ 10th Annual Symposium on Computer Architecture, Stockholm,Sweden, pp. 108-116 (June 1983).

[Seit80] C. L. Seitz, ‘‘System Timing,’’ in Introduction to VLSI Systems (Meadand Conway), Addison-Wesley, Reading, MA (1980).

[Siew92] D. P. Siewiorek, D. Ciplickas, J. Willis, A. Gupta, and J. Quinlan,‘‘Laboratory Experiences with Verilog Simulation in an UndergraduateComputer Architectdure Course,’’ Proceedings of the Annual OpenVerilog International User Group Meeting, Santa Clara, CA (March24-25, 1992).

[Stro85] R. E. Strom and S. Yemini, ‘‘Optimistic Recovery in DistributedSystems,’’ ACM Transactions on Computer Systems 3(3), pp. 204-226(August 1985).

[Suth89] I. E. Sutherland, ‘‘Micropipelines,’’ Communications of the ACM 32(6),pp. 720-738 (June 1989).

207

[Tami88] Y. Tamir, M. Tremblay, and D. A. Rennels, ‘‘The Implementation andApplication of Micro Rollback in Fault-Tolerant VLSI Systems,’’ 18thFault-Tolerant Computing Symposium, Tokyo, Japan, pp. 234-239(June 1988).

[Tami90a] Y. Tamir and M. Tremblay, ‘‘High-Performance Fault-Tolerant VLSISystems Using Micro Rollback,’’ IEEE Transactions on Computers C-39(4), pp. 548-554 (April 1990).

[Tami90b] Y. Tamir, M. Liang, T. Lai, and M. Tremblay, ‘‘The UCLA MirrorProcessor: A Building Block for Self-Checking Self-RepairingComputing Nodes,’’ CS Department Technical Report #CSD-900040,University of California, Los Angeles, CA (November 1990).

[Trem89] M. Tremblay and Y. Tamir, ‘‘Support for Fault Tolerance in VLSIProcessors,’’ International Symposium on Circuits and Systems,Portland, OR, pp. 388-393 (May 1989).

[TRW92] TRW, ‘‘RH32 Spaceborne Data Processor,’’ Preliminary ProductAnnouncement, TRW Space Communications Division, RedondoBeach, CA (March 1992).

[Wils92] R. Wilson, ‘‘VLSI Meet Sees CPUs Speed Up,’’ Electronic EngineeringTimes, p. 1 (June 8, 1992).

[Wint92] K. D. Winters, ‘‘ASIC Design Experience in Undergraduate LogicDesign and Computer Architecture Courses,’’ Engineering ResearchLaboratory Report #92005, Montana State University, Bozeman, MT(April 10, 1992).

208

UNIVERSITY OF CALIFORNIA Los Angeles AMPIRE …fmdb.cs.ucla.edu/Treports/950016.pdfUNIVERSITY OF...

Documents

Transcript of UNIVERSITY OF CALIFORNIA Los Angeles AMPIRE …fmdb.cs.ucla.edu/Treports/950016.pdfUNIVERSITY OF...