16-Bit RISC CPU - pudn.comread.pudn.com/downloads163/doc/project/745365/813S02-Team8.pdfThe basic...
-
Upload
truongkhanh -
Category
Documents
-
view
226 -
download
6
Transcript of 16-Bit RISC CPU - pudn.comread.pudn.com/downloads163/doc/project/745365/813S02-Team8.pdfThe basic...
ECE813 Design Project 4
16-Bit RISC CPU
By: Kelly Davidson Guangda Shi
Xin Zhao Junwei Zhou
Submitted to Dr. A. Mason April 22, 2002
1
I. INTRODUCTION
This report summarizes the results of our work in building a 16-bit RISC CPU. Our
original goal was to create an 8-bit ALU, but as the design progressed, it was decided to produce
a 16-bit ALU, which is more useful in computation. This larger design included the datapath for
a CPU containing Program Counter (PC), Memory Address Register (MAR), Memory Data
Register (MDR), Instruction Register (IR), Register File (RF) and other various components to
tie them all together. The controller of the CPU was implemented using Verilog.
The system described above was simulated in two ways. Analog simulation was used to
verify functionality on a smaller scale, and to determine the delay times of various portions of
the circuit. After determining the system characteristics such as timing information, the entire
system was verified using Verilog simulation.
The base design of this system was done using static CMOS logic. It includes layout of
basic standard cells as well as our 16-bit ALU. Our 16-bit ALU was comprised of 16 1-bit
ALUs strung together. The ALU performed 2’s complement arithmetic in one execution step by
making use of a Controlled Adder/Subtractor (CAS) unit described later. Alternative designs for
this project focused on the ALU portion of the system. The goal of these alternative designs was
to improve speed and power consumption as set out in our proposal. The first alternative design
was to make use of a Carry Look Ahead circuit to improve the speed of the ALU and thus the
overall system. It resulted in an over 60% improvement in processing time.
The most important part of our alternative design involved the adaptation of a newly
published technology. To accomplish this goal, a dynamic differential logic family called Swing
Limited Logic (SLL) was used based on a paper by Amr M. Fahim [1]. This work was published
in January of 2002 in the IEEE Journal of Solid-State Circuits. Even though it was originally
designed for a 0.35 µm feature size and 3.3 V system, it was successfully adapted and made to
work on our 0.60 µm CMOS process at 3.0 V.
Section II discusses the methodology used to design our 16bit ALU and the incorporation
of the ALU into the 16bit RISC CPU. In section III, the details of our ALU and RISC CPU
designs and their simulation results are presented. The layout of the base case ALU and its post
layout simulation results are also discussed in Section III.
2
II. METHODOLOGY
This design had several challenges to it. The first was that none of us in the group had
previous experience in the design of ALUs. This required us to do some basic research to obtain
information on the design of ALUs, the instruction sets often used, accumulators, and control
circuitry. The instruction set for the ALU ended up being determined by taking common
instructions from various 2-bit and 4-bit ALU datasheets.
The architecture of the ALU led to another design decision. There were several ways to
approach creating the ALU. Our group wanted to make use of the CAS unit so that 2’s
complement arithmetic could be performed in one operation rather than several executions
through the ALU. This led to the design in which multiple functions being executed
simultaneously and the desired output was chosen using a mux network
The basic architecture we ended up adopting for the 16-bit RISC CPU was based on a
DLXS processor from a paper by Martin Gumm [2]. This architecture would take care of the
timing and control aspects of the design but required the complete circuit simulation to be done
digitally. As a result, digital simulation and functional coding become another challenge for our
design. There were also concerns with the idea of producing an entirely digital model of the
system since that is somewhat outside the scope of this class, and wouldn’t really verify the
analog aspects of our circuit.
This led to the creation of a smaller system that is small enough to simulate using analog
simulation, and still verify the correct operation of the ALU, latches, accumulators, time delays,
and voltage levels of the various signals. The control circuitry for this analog simulation was
simulated using stimulus files. Because the architecture of the DLXS processor required a 16-bit
ALU, registers, and various other logic units, the smaller circuit still had more than 11,000
transistors. This made the analog simulation of the circuit somewhat time consuming.
One of the biggest challenges was to make use of swing limited logic [1]. Even though
this technology has the advantage of low power consumption it does not scale well with supply
voltage and older technology file. After long struggle with the swing limited logic, we were able
to make it work with our 0.6µm technology but the power gain from the limited swing was
diminished by large transistors we had to employ to make the technology work.
3
III. DESIGN AND RESULTS
The 16-bit RISC CPU design is broken down into several sections. Each section will
discuss the design and results of that aspect of our system. The three different areas of our
design are the ALU, base case ALU Layout and system architecture.
ALU Design
The Arithmetic Logic Unit (ALU) designed in this project performs eight different
operations (shown in Table 1) on two 16-bit inputs. There are three alternate designs
investigated by this design team. The base case design involved using a regular ripple carry
method for the addition operation. The more advanced design utilizes the carry-look-ahead
method for carry generations in order to speed up the performance of the ALU. For the third
alternative design, swing limited differential logic was used with the goal of reducing power
consumption of the circuit and to reduce the power delay product.
Table 1: Operations performed by ALU
Select: S2 S1 S0 Operations
0 0 0 A plus B
0 0 1 A minus B
0 1 0 B minus A
0 1 1 A and B
1 0 0 A or B
1 0 1 A xor B
1 1 0 Pass A
1 1 1 Pass B
Base Case Design – Ripple Carry ALU
At the heart of our ALU design is the controlled adder/subtractor (CAS) unit. This unit
has the ability of performing A+B, A-B or B-A depends on the control instruction. The CAS is
constructed based on the principle of two’s complement method of subtraction (i.e., A minus B is
the same as A plus the complement of B then plus “1”). The two’s complement method of
subtraction is illustrated below:
4
A: 10010101- B: 10001010
00001011
A: 10010101
00001010
+ B’: 01110101
+ 1
00001011
Regular Subtraction Two’s Complement As one can see from the CAS unit shown in Figure 1(a), the complement of input A or B
is done using two xor gates.
Figure 1: (a) controlled adder/subtractor. (b) 1bit ALU.
This design enables the control signal S1 and S0 to control which input bit is
complemented, and the subsequent operation is just like regular addition. Of course the extra
“1” needed for the two’s complement method must be generated also for this operation to be
successful. The way of generating the extra “1” is by oring the select line S1 and S0 then using
the result as Ci (the carry in) in the first bit of the 16-bit implementation. This way if the
subtraction is desired, either S1 or S0 must be “1” according to Table 1 and the Ci will become
“1” as required by the operation. If addition operation is required, the result of the or gate will
be “0”. After completing the controlled adder/subtractor unit, the 1-bit ALU was constructed as
shown in Figure 1(b). The 1-bit input A or B are fed to all 8 different operation units and the
final results are fed to the multiplexer (MUX) for the selection of the correct result.
5
After completing the design of a 1-bit ALU unit, the 16-bit ALU is designed simply by
connecting 16 of the 1-bit ALUs in series and wiring all the proper inputs (See Figure 2). In
order to establish a baseline on the performance of the 16-Bit ALU circuit, the “Ripple Carry”
technique was used for the construction of ALU. Even though this design method is easy to
understand, it’s very slow compared to more advanced techniques such as carry-look-ahead
algorithm for the adder/subtractor.
Figure 2: Complete block level schematic of the 16-Bit ALU.
As discussed in the controlled adder/subtractor section, the or gate used in the 16-Bit
ALU simply generates the additional “1” required for the 2’s complement subtraction.
Simulation Results
The performance of the ripple carry ALU is shown in Figure 3. The outputs “F<15:0>”,
carry out “Co” and control signals “S2, S1, S0” are shown. All eight operations of the ALU
were verified and the timing of the ALU was determined from this simulation. According to the
simulation result, the addition/subtraction operation took the longest time, about 24ns. The
simulation was done using the following two 16-bit inputs: A = 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1 1 B = 1 0 0 1 0 0 0 0 1 0 0 0 1 1 0 1
As indicated by the dotted line on Figure 3, the time delay for the addition operation is
very linear because the carry needs to ripple through all 16 stages before reaching the final
answer. The maximum time delay at the 15th (F<15>, most significant bit) output is about 24ns
for the entire operation (the select line S0 has a period of 60ns). An example of the A and B
operation (S2S1S0=011) is also shown in Figure 3 for illustration purpose
6
Figure 3: Analog simulation result for 16-Bit ALU (Ripple Carry).
Alternate Design – Carry-Look-Ahead ALU
To improve the performance of our ALU design, a 4-bit carry-look-ahead circuit was
added to the ALU. This addition will reduce the time delay of the carry and eliminate the series-
connection between each individual 1-bit ALU unit.
A and B
24ns
7
Figure 4: (a) 4-bit carry-look-ahead unit. (b) 4-bit carry-look-ahead ALU.
In order to keep the function of the controlled adder/subtractor unit, the CLA unit was
designed after the xor operation on the inputs (see Figure 4(a)). After the CLA unit was
designed, the CLA and four 1-bit ALUs were connected together to form a 4-bit carry-look-
ahead ALU (Shown in Figure 4(b)). Then the 16bit ALU was constructed by connecting four 4-
bit carry-look-ahead ALUs in series (see Figure 5)
Figure 5: Schematic of a 16-Bit Carry-Look-Ahead ALU.
Simulation Results
The stimulus file used for the simulation is the same as the base case with slight
modification on select line timing. The outputs are identical to our base case simulation, but
with a much shorter computation delay time of 9ns (See Figure 6).
8
Figure 6: Analog simulation result for 16-Bit Carry-Look-Ahead ALU.
Alternate Design – Swing Limited Logic ALU
The final alternative design of the 16-bit ALU was to implement swing limited logic. A
sample SLL nor gate is shown in Figure 7. Its basic operation is explained as follows. On the
precharge cycle the clock is high, the outputs are tri-stated off and the nMOS transistor M10
pulls the Comp node to zero. This will turn on the two pMOS transistors P1 and P0 to pull up
PreY and PreYbar to Vdd. The bulk of the pMOS transistors P0,1,3 & 4 are tied to Vcc, which
for our design was set at 5V. This was to increase the voltage threshold so Comp will shut off
the precharge transistors at a lower voltage
As the clock ends the precharge cycle and starts to fall low, transistors M5 & M10 in the
inverter are both on for a small period of time. Either one of the nMOS ‘0’ or ‘1’ trees will be
9ns delay
9
on, allowing that side to discharge. Due to the presence of M5, the low voltage level is only
around 1.8V. This voltage depends greatly on the W/L ratio of the transistors in the short-circuit
current path (path to ground). At the same time the tri-state output gates also start to turn on
allowing the Y and Ybar to be transmitted to the succeeding gates. Comp will only rise to Vdd-
Vtn as the clock is inverted through the bottom inverter, but with the higher threshold voltage for
the precharge transistors, it is enough to turn them off, so they quit providing a pull up current to
nodes PreY and PreYbar. This design does allow the clock pulse to be reduced from 0 to Vdd-
Vtn, which is another energy saving feature. The node Comp can also be used as a clock signal
to the next gate. Unfortunately, the power consumption for this gate when averaged over a 30 ns
period was 37.51 µW versus 11.24 µW for a static CMOS nor gate. However it does provide
complimentary outputs.
Figure 7: SLL nor/or gate.
The paper had mentioned that due to the stack of pMOS- nMOS-pMOS structure, this
logic family does not scale well with supply voltage. However, after much effort the swing
limited logic was successfully implemented in our design. Basic gates that were needed for the
ALU were constructed, and then they were put together to form a CAS (Figure 8(a)) and
eventually the 1-bit ALU (Figure 8(b)).
10
Figure 8: (a) SLL CAS. (b) SLL 1-bit ALU.
For this ALU, instead of using typical 2 to 1 mux a 4 to 2 mux was developed which only
consisted of an inverter and pMOS pass gates, Figure 9(a). The nMOS was not necessary since
the voltage is in the ~2.0 – 2.8V range. Finally, a level converter was constructed for the output
of the ALU to convert the SLL voltage level to the static CMOS voltage level. This basically
consisted of a sense circuit that would cause the output voltage to be pulled all the way high, or
low. This level converter was also sized to obtain a quicker performance of 1.5ns delay. It is
shown in Figure 9(b).
Figure 9: (a) SLL mux 4:2. (b) SLL level converter.
11
The 1-bit ALU’s were then connected together to form a 16-bit ALU. The 1-bit ALU
was tested for all possible inputs, to verify its operation using analog simulation. The 16-bit
ALU was also tested to verify proper results for every operation.
Due to the nature of dynamic differential Logic where a succeeding gate needs to wait for
the evaluation cycle of the previous gate before it can switch to evaluate, there is a several gate
delay to get the result for 1 bit. In a 16-bit ripple carry adder this amounts to about 120 ns for the
result to be fully computed in the worst case with a carry rippling at every bit. While it is a little
slow, it does use a lower voltage swing clock, and input/output signals. This logic could be
much faster if a 16bit carry-look-ahead unit was implemented.
Simulation
Simulations of the 1 bit ALU are shown below in Figure 10. It includes the output after
going through the SLL to CMOS level converter. The figure is annotated to show the state of the
inputs going in, the operation being performed, and the states of the outputs at various points. It
takes almost 6 ns for the signal to propagate through on an addition or subtraction operation. In
the figure, the Ci bit was set to 1. A 16-bit version of the ALU was simulated as well with
correct results.
A+B
0
0
1
0
0
1
Y
Ybar
1
1
0
1
1
0
0
A-B
0
1
1
0
1
0
1
B-A A AND
1
1
1
0
1
Figure 10: (a) SLL 1bit ALU Simulation Results (A+B, A-B, B-A, AandB).
12
Y
Ybar
A OR
0
0
1
0
0
A
0
1
0
1
1
Pass A
1
1
0
1
0
Pass A
1
0
1
0
0
Figure 10: (b) SLL 1bit ALU Simulation Results (AorB, AxorB, pass A, Pass B).
Base Case ALU Layout
Layout is an important part of the project. We have reused some basic gates (such as nor,
nand, buffer9) to efficiently reduce the work of the design. In addition, xor2 gate, xor3 gate and
3-input nor gate have been completed and tested. The minimum size transistors were used to
reduce the area of the logic gates.
In order to verify the design from the perspective of the layout, a full functioning 1 bit-
ALU layout has been completed as well as the simulation. This layout is using 16 instances of
the 1-bit ALU and cascading them into a 16-bit ALU. In order to improve the driving ability of
the selecting signals additional buffers were used. The 1-bit ALUs are layed out in square shape
as well as the 16-bit ALU (See Figure 11). All the input and output ports are connected to the
edge of the instance for the convenience of cascading and connecting to other parts. It was a
challenging work to complete the 1-bit ALU and the 16-bit ALU since the objective is to
minimize the area maintaining the performance. The following table shows the area of some
layouts.
B xor B
13
Table 2: Sizes of various layouts
Parts 3-input NOR 2-input XOR 3-input XOR 1-bit ALU 16 – bit ALU
Height
Width
15.60u
0.35u
15.60u
24.75u
15.45u
49.05u
110.1u
99.150u
606.3u
497.85u
Figure 11: 16-Bit ALU with ripple carry logic.
The simulation for the 1-bit ALU and the 16-bit ALU was also performed. The
simulation results (See Figure 12) verified the design but the computation delay time is longer
than the schematic simulation (40ns for addition). Possibly due to larger extracted parasitic
capacitance and resistance since there are many wires and transistors in the large layout.
14
Figure 12: Simulation result from layout of 16-bit ALU.
System Architecture
In this project, the 16-bit RISC CPU was based on the architecture of a DLXS processor.
It was chosen because of its simple instruction set and its easily understandable architecture.
DLXS consists of the controller and the datapath. The controller generates the signals to control
the data flow and the datapath executes all operations on the given data set.
Reset
RW
DLXS
Controller
Datapath
Phi2
Phi1
address
memory
Figure 13: DLXS Architecture.
40ns
15
Datapath
The schematic of datapath is shown in Figure 14. The datapath contains all registers, the
ALU and the internal data buses.
Figure 14: Datapath Schematic.
The fundamental operation of the datapath is reading operands from the register file,
operating on them in the ALU, and then writing the result back to the register file or to the
various control registers. Three internal buses are used in datapath: the source bus1 (S1), source
bus2 (S2) and the destination bus (Dest). The controller selects the registers, which the data is
loaded from and written to. It should be noted that ALU is the only path between the source bus
16
and the destination bus. The pass operation within the ALU is to move the data from the source
bus to the destination bus without any modification.
Controller
The key component of the controller is the finite state machine, which is used to generate
a sequence of control signal necessary for the data flow in the datapath. Figure 15 below
illustrates some of the outputs from the controller. Sbus_ctrl is to control the read enable signal
of the registers connected to the source bus. Dbus_ctrl is to control the write enable signal of the
registers on the destination bus. Dbus_ctrl is gated by phi2 while the other control signals are
synchronized with phi1. RS1, RS2 and RD are the specific registers in the register file. Take
add operation for example, the sum of the data in RS1 and RS2 is computed by the ALU and is
stored in RD.
Sbus_ctrl[3:0]
Alu_ctrl[2:0]
Memory_ctrl
RS1 address[2:0]
RD_address[2:0]
RS2 address[2:0]
FSM
Dbus_ctrl[3:0] And
LOGIC
Instrucation register[15:0]
Phi2
Phi1
Figure 15: Controller structure.
In this project, the controller is implemented with Verilog. We used the Cadence Logic
Verification Tool to verify the logic of the controller and the datapath. To combine both the
verilog module and the schematic module together, a verilog file was created for each standard
cell used in the registers and ALU. Take Mux21 cell for example, its verilog model is: module mux21 (Y, A, B, S); input A, B, S; output Y; reg Y; always @(A or B or S) if (S == 1’b0) Y <= A; else Y <= B; endmodule
Apart from the controller, Verilog is also used to simulate the behavior of the memory
where the instruction and data are located. After the controller is reset, it will control the
17
datapath to fetch instruction from the memory and point the program counter to the next
instruction. For simplicity, only a few of DLXS instructions are implemented as follows:
Table 3: Sample of DLXS instructions Opcode Instruction Operands
0000 add rs1[3:0], rs2[3:0], rd[3:0] rd = rs1 + rs2
0001 sub rs1[3:0], rs2[3:0], rd[3:0] rd = rs1 – rs2
0010 and rs1[3:0], rs2[3:0], rd[3:0] rd = rs1 and rs2
0011 or rs1[3:0], rs2[3:0], rd[3:0] rd = rs1 or rs2
0100 xor rs1[3:0], rs2[3:0], rd[3:0] rd = rs1 xor rs2
1000 load rd[3:0], #addr[7:0] rd = memory[addr]
1001 write rs, #addr[7:0] memory[addr] = rs1
1010 goto #addr[11:0] pc = #addr
State transmission diagram
The instructions of the DLXS can be broken into five basic steps: fetch, decode, execute
memory access and write result. No pipeline was implemented in this structure and each step
may take several clock cycles. The state transmission of the FSM of controller is shown below.
mem ready
mem ready
goto
mem not ready
State: load_data Action: pass the data in MDR to the
destination bus
mem not ready
State: write_data Action: write enable = ‘1’
mem not ready
mem ready
write
load
State: Fetch Action: read memory
State: decode Action: save the instruction in IR;
Decode the instruction PC <= PC + 2;
State: execute Action: latch the data on source bus
to ALU
State: load_addr Action: MAR = IR[11:0]
State: write_back Action: save the data to register C
State:loadPC Action: PC = IR[11:0]
Add, subtract, AND, or, XOR State: write_addr
Action: MAR = IR[11:0]
Figure 16: State Transmission of FSM of controller.
18
The dataflow of the load instruction after the processor was reset is explained below.
When the reset signal goes high, the programming counter (PC) and other registers are set to 0.
In the next clock circle, controller goes to the fetch state and PC is connected to the memory
address bus by mux1. The controller will wait until the memory is ready and the instruction data
is available on the data bus. Then the load instruction is written to the instruction register (IR)
by setting the IR write enable signal. In the next loop, the offset in the load instruction is passed
to the memory address register (MAR). Because ALU is the only path between the source bus
and the destination bus, we need to set the ALU operation to the pass function. Then the
controller goes to the load data state. MAR instead of PC will be selected to connect to the
memory address bus. When the memory is ready, the data loaded from the memory will be
stored in the memory data register (MDR). In the next loop, the data in MDR will be passed
through ALU and saved in the register C.
To test the logic of the controller and the datapath, we calculate a sequence of numbers
defined by the function: 2,1
)2(,
21
12
==>+= ++
ffnfff nnn
The sequence of numbers should be like (1, 2, 3, 5, 8, 13, 21, 34 ……). The program and
the data are stored in the memory described by Verilog.
Figure 17: (a) control signal for the program.
19
Figure 17: (b) data bus of the datapath.
The control signal and the data bus waveform are shown in the Figure 17. The IR
contains the instruction fetched from the memory. Cout in the data bus waveform is the data
written to the register C in the datapath. As shown in the figure, the data sequence of Cout is
0x0001, 0x0002, 0x0003, 0x0005, 0x0008, 0x000D …which is the same as the sequence above.
The schematic layout of the RISC CPU with simulated memory block is shown in Figure 18.
Figure 18: 16Bit RISC CPU with memory block.
Controller
Memory
Datapath
20
Analog Simulation of the Datapath
To verify that the analog circuit was correct, a “small” analog model was constructed that
consisted of the base 16-bit ALU, the register file, and the registers in between. This “smaller”
system still consisted over 11,000 transistors. The system had five different styles of registers
composed of D Flip Flops, some with output enables as well as latch enables, and some with
dual outputs. The simulation verifies the correct performance of the ALU and its interaction
with the registers. The result of the simulation is shown in Figure 19 below. It takes 40 ns for the
data to go completely around the loop. The data starts out from some constants at A & B busses.
We are looking at an addition function and specifically at bit 3 in this simulation. BusL1 and
BusL2 hold the data that enters the ALU, and C has the data as it comes out.
Figure 19: Analog Simulation of the data busses of the datapath.
21
CONCLUSIONS In this project, our team has completed three alternative designs of a 16-bit ALU for use
in a RISC CPU. The implementation of carry-look-ahead logic in our second alternative ALU
design produced over 60% improvement in computation speed over the base case. All three
designs of our ALU performed 2’s compliment subtraction in one loop through the ALU. The
most significant accomplishment of the design project is the adaptation of the swing limited logic
in our ALU design. This team was able to make the logic work for the technology file given
even though this logic was designed for a much-advanced technology. We were even able to
shorten the clock cycle for precharge and evaluate to 3ns, which is very close to the reference
paper. However, this design didn’t produce the power savings promised in the reference paper
probably due to the larger transistors that we had to use in building the basic gates. The 16-bit
CPU was then constructed from all of the various parts built in this project including the ALU,
registers, buffers, controller, memory and other various pieces. The 16-Bit RISC CPU simulated
and verified to be functional as designed using the Verilog simulator in Cadence.
REFERENCES: 1 Amr M. Fahim, “Low-Power High-Performance Arithmetic Circuits and Architectures”, IEEE Journal of Solid-State Circuits, Vol. 37, No. 1, pp. 90-94, January 2002. 2 Martin. Gumm, “VHDL-Modeling And Synthesis of The DLXS RISC processor” 1995, University of Stuttgart.