CS35101 Computer Architecture Spring 2006 Week 8 P Durand (durand) [Adapted from MJI (mji)] [Adapted...
-
Upload
lisa-robinson -
Category
Documents
-
view
222 -
download
8
Transcript of CS35101 Computer Architecture Spring 2006 Week 8 P Durand (durand) [Adapted from MJI (mji)] [Adapted...
CS35101Computer
ArchitectureSpring 2006
Week 8
P Durand (www.cs.kent.edu/~durand)
[Adapted from MJI (www.cse.psu.edu/~mji)]
[Adapted from Dave Patterson’s UCB CS152 slides]
Head’s Up This week’s material
MIPS logic and multiply instructions- Reading assignment – PH 3.4
MIPS ALU design- Reading assignment – PH 3.5
Reminders
Next week’s material Building a MIPS datapath
- Reading assignment – PH 5.1-5.2
Review: MIPS Arithmetic Instructions
R-type:
I-Type:
31 25 20 15 5 0
op Rs Rt Rd funct
op Rs Rt Immed 16
Type op funct
ADD 00 100000
ADDU 00 100001
SUB 00 100010
SUBU 00 100011
AND 00 100100
OR 00 100101
XOR 00 100110
NOR 00 100111
Type op funct
00 101000
00 101001
SLT 00 101010
SLTU 00 101011
00 101100
0 add
1 addu
2 sub
3 subu
4 and
5 or
6 xor
7 nor
a slt
b sltu
expand immediates to 32 bits before ALU10 operations so can encode in 4 bits
32
32
32
m (operation)
result
A
B
ALU
4
zeroovf
11
Review: A 32-bit Adder/Subtractor
1-bit FA S0
c0=carry_in
c1
1-bit FA S1
c2
1-bit FA S2
c3
c32=carry_out
1-bit FA S31
c31
. .
.
Built out of 32 full adders (FAs) A0
B0
A1
B1
A2
B2
A31
B31
add/subt
1 bit FA
A
BS
carry_in
carry_out
S = A xor B xor carry_in
carry_out = AB v Acarry_in v Bcarry_in (majority function)
Small but slow!
Minimal Implementation of a Full Adder
architecture concurrent_behavior of full_adder is
signal t1, t2, t3, t4, t5: std_logic;
begin
t1 <= not A after 1 ns;
t2 <= not cin after 1 ns;
t4 <= not((A or cin) and B) after 2 ns;
t3 <= not((t1 or t2) and (A or cin)) after 2 ns;
t5 <= t3 nand B after 2 ns;
S <= not((B or t3) and t5) after 2 ns;
cout <= not(t1 or t2) and t4) after 2 ns;
end concurrent_behavior;
Can you create the equivalent schematic? Can you determine worst case delay (the worst case timing path through the circuit)?
Gate library: inverters, 2-input nands, or-and-inverters
Logic Operations Logic operations operate on individual bits of the
operand. $t2 = 0…0 0000 1101 0000
$t1 = 0…0 0011 1100 0000
and $t0, $t1, $t2 $t0 =
or $t0, $t1 $t2 $t0 =
xor $t0, $t1, $t2 $t0 =
nor $t0, $t1, $t2 $t0 =
How do we expand our FA design to handle the logic operations - and, or, xor, nor ?
0…0 0000 1100 0000
0…0 0011 1101 0000
0…0 0011 0001 0000
1…1 1100 0010 1111
A Simple ALU Cell
1-bit FA
carry_in
carry_out
A
B
add/subt
add/subt
result
op
An Alternative ALU Cell
1-bit FA
carry_in
s1
s2
s0
result
carry_out
A
B
The Alternative ALU Cell’s Control Codes
s2 s1 s0 c_in result function
0 0 0 0 A transfer A
0 0 0 1 A + 1 increment A
0 0 1 0 A + B add
0 0 1 1 A + B + 1 add with carry
0 1 0 0 A – B – 1 subt with borrow
0 1 0 1 A – B subtract
0 1 1 0 A – 1 decrement A
0 1 1 1 A transfer A
1 0 0 x A or B or
1 0 1 x A xor B xor
1 1 0 x A and B and
1 1 1 x !A complement A
Need to support the set-on-less-than instruction (slt)
remember: slt is an arithmetic instruction
produces a 1 if rs < rt and 0 otherwise
use subtraction: (a - b) < 0 implies a < b
Need to support test for equality (beq)
use subtraction: (a - b) = 0 implies a = b
Need to add the overflow detection hardware
Tailoring the ALU to the MIPS ISA
Modifying the ALU Cell for slt
1-bit FA
A
B
result
carry_in
carry_out
add/subt op
add/subt
less
Modifying the ALU for slt
0
0set
First perform a subtraction
Make the result 1 if the subtraction yields a negative result
Make the result 0 if the subtraction yields a positive result
A1
B1
A0
B0
A31
B31
+
result1
less
+
result0
less
+
result31
less
. . .tie the most significant sum bit (sign bit) to the low order less input
Modifying the ALU for Zero
+
A1
B1
result1
less
+
A0
B0
result0
less
+
A31
B31
result31
less
. . .
0
0
set
First perform subtraction
Insert additional logic to detect when all result bits are zero
zero
. . .
add/subtop
Note zero is a 1 when result is all zeros
Review: Overflow Detection Overflow: the result is too large to represent in the
number of bits allocated
Overflow occurs when adding two positives yields a negative or, adding two negatives gives a positive or, subtract a negative from a positive gives a negative or, subtract a positive from a negative gives a positive
On your own: Prove you can detect overflow by: Carry into MSB xor Carry out of MSB
1
1
1 10
1
0
1
1
0
0 1 1 1
0 0 1 1+
7
3
0
1
– 6
1 1 0 0
1 0 1 1+
–4
– 5
71
0
Modifying the ALU for Overflow
+
A1
B1
result1
less
+
A0
B0
result0
less
+
A31
B31
result31
less
. . .
0
0
set
Modify the most significant cell to determine overflow output setting
Disable overflow bit setting for unsigned arithmetic
zero
. . .
add/subtop
overflow
But What about Performance? Critical path of n-bit ripple-carry adder is n*CP
Design trick – throw hardware at it (Carry Lookahead)
A0
B0
1-bitALU
Result0
CarryIn0
CarryOut0
A1
B1
1-bitALU
Result1
CarryIn1
CarryOut1
A2
B2
1-bitALU
Result2
CarryIn2
CarryOut2
A3
B3
1-bitALU
Result3
CarryIn3
CarryOut3
Shift Operations Also need operations to pack and unpack 8-bit
characters into 32-bit words
Shifts move all the bits in a word left or right
sll $t2, $s0, 8 #$t2 = $s0 << 8 bits
srl $t2, $s0, 8 #$t2 = $s0 >> 8 bits
Such shifts are logical because they fill with zeros
op rs rt rd shamt funct
000000 00000 10000 01010 01000 000000
000000 00000 10000 01010 01000 000010
Shift Operations, con’t
An arithmetic shift (sra) maintain the arithmetic correctness of the shifted value (i.e., a number shifted right one bit should be ½ of its original value; a number shifted left should be 2 times its original value)
so sra uses the most significant bit (sign bit) as the bit shifted in
note that there is no need for a sla when using two’s complement number representation
sra $t2, $s0, 8 #$t2 = $s0 >> 8 bits
The shift operation is implemented by hardware (usually a barrel shifter) outside the ALU
000000 00000 10000 01010 01000 000011
Multiply
Binary multiplication is just a bunch of right shifts and adds
multiplicand
multiplier
partialproductarray
double precision product
n
2n
ncan be formed in parallel and added in parallel for faster multiplication
More complicated than addition accomplished via shifting and addition
0010 (multiplicand) x_1011 (multiplier)
0010 0010 (partial product
0000 array) 0010 00010110 (product)
Double precision product produced
More time and more area to compute
Multiplication
Multiply produces a double precision product
mult $s0, $s1 # hi||lo = $s0 * $s1
Low-order word of the product is left in processor register lo and the high-order word is left in register hi
Instructions mfhi rd and mflo rd are provided to move the product to (user accessible) registers in the register file
MIPS Multiply Instruction
op rs rt rd shamt funct
Multiplies are done by fast, dedicated hardware and are much more complex (and slower) than adders
Hardware dividers are even more complex and even slower; ditto for hardware square root
mult $s0, $s1 # hi||lo = $s0 * $s1
Low-order word of the product is left in processor register lo and the high-order word is left in register hi
Instructions mfhi rd and mflo rd are provided to move the product to (user accessible) registers in the register file
MIPS Multiply Instruction
op rs rt rd shamt funct
000000 10000 10001 00000 00000 011000
Multiplication: Implementation
Datapath Control
MultiplicandShift left
64 bits
64-bit ALU
ProductWrite
64 bits
Control test
MultiplierShift right
32 bits
32nd repetition?
1a. Add multiplicand to product and
place the result in Product register
Multiplier0 = 01. Test
Multiplier0
Start
Multiplier0 = 1
2. Shift the Multiplicand register left 1 bit
3. Shift the Multiplier register right 1 bit
No: < 32 repetitions
Yes: 32 repetitions
Done
Final Version
Multiplicand
32 bits
32-bit ALU
ProductWrite
64 bits
Controltest
Shift right
32nd repetition?
Product0 = 01. Test
Product0
Start
Product0 = 1
3. Shift the Product register right 1 bit
No: < 32 repetitions
Yes: 32 repetitions
Done
What goes here?
•Multiplier starts in right half of product
Division
Division is just a bunch of quotient digit guesses and left shifts and subtracts
dividenddivisor
partialremainderarray
quotientnn
remainder
n
0 0 0
0
0
0
Divide generates the reminder in hi and the quotient in lo
div $s0, $s1 # lo = $s0 / $s1
# hi = $s0 mod $s1
Instructions mfhi rd and mflo rd are provided to move the quotient and reminder to (user accessible) registers in the register file
MIPS Divide Instruction
As with multiply, divide ignores overflow so software must determine if the quotient is too large. Software must also check the divisor to avoid division by 0.
op rs rt rd shamt funct
Division: Implementation
Division Implementation
Integer Division – Example 1
Improved Division Hardware
Note that the divisor register, the alu, and the quotient register are 32 bits wide. The remainder register is still 64 bits. The quotient register is combined with the right half of the remainder register.
Integer Division – Example 3
Floating Point (a brief look)
We need a way to represent numbers with fractions, e.g., 3.1416
very small numbers, e.g., .000000001
very large numbers, e.g., 3.15576 109
Representation: sign, exponent, significand: (–1)sign significand 2exponent
more bits for significand gives more accuracy
more bits for exponent increases range
IEEE 754 floating point standard: single precision: 8 bit exponent, 23 bit significand
double precision: 11 bit exponent, 52 bit significand
IEEE 754 floating-point standard
Leading “1” bit of significand is implicit
Exponent is “biased” to make sorting easier all 0s is smallest exponent all 1s is largest bias of 127 for single precision and 1023 for double precision summary: (–1)sign significand) 2exponent – bias
Example:
decimal: -.75 = - ( ½ + ¼ ) binary: -.11 = -1.1 x 2-1
floating point: exponent = 126 = 01111110
IEEE single precision: 10111111010000000000000000000000
Floating Point Complexities
Operations are somewhat more complicated (see text)
In addition to overflow we can have “underflow”
Accuracy can be a big problem
IEEE 754 keeps two extra bits, guard and round
four rounding modes
positive divided by zero yields “infinity”
zero divide by zero yields “not a number”
other complexities
Implementing the standard can be tricky
Not using the standard can be even worse see text for description of 80x86 and Pentium bug!
Representing Big (and Small) Numbers What if we want to encode the approx. age of the earth?
4,600,000,000 or 4.6 x 109
or the weight in kg of one a.m.u. (atomic mass unit) 0.0000000000000000000000000166 or 1.6 x 10-27
There is no way we can encode either of the above in a 32-bit integer.
Floating point representation (-1)sign x F x 2E
Still have to fit everything in 32 bits (single precision)
s E (exponent) F (fraction)1 bit 8 bits 23 bits
The base (2, not 10) is hardwired in the design of the FPALU More bits in the fraction (F) or the exponent (E) is a trade-off
between precision (accuracy of the number) and range (size of the number)
IEEE 754 FP Standard Encoding
Most (all?) computers these days conform to the IEEE 754 floating point standard (-1)sign x (1+F) x 2E-bias
Formats for both single and double precision F is stored in normalized form where the msb in the fraction
is 1 (so there is no need to store it!) – called the hidden bit To simplify sorting FP numbers, E comes before F in the
word and E is represented in excess (biased) notation
Single Precision Double Precision Object Represented
E (8) F (23) E (11) F (52)
0 0 0 0 true zero (0)
0 nonzero 0 nonzero ± denormalized number
± 1-254 anything ± 1-2046 anything ± floating point number
± 255 0 ± 2047 0 ± infinity
255 nonzero 2047 nonzero not a number (NaN)
Floating Point Addition
Addition (and subtraction)
(F1 2E1) + (F2 2E2) = F3 2E3
Step 1: Restore the hidden bit in F1 and in F2 Step 1: Align fractions by right shifting F2 by E1 - E2 positions
(assuming E1 E2) keeping track of (three of) the bits shifted out in a round bit, a guard bit, and a sticky bit
Step 2: Add the resulting F2 to F1 to form F3 Step 3: Normalize F3 (so it is in the form 1.XXXXX …)
- If F1 and F2 have the same sign F3 [1,4) 1 bit right shift F3 and increment E3
- If F1 and F2 have different signs F3 may require many left shifts each time decrementing E3
Step 4: Round F3 and possibly normalize F3 again Step 5: Rehide the most significant bit of F3 before storing the
result
CSE331 W08.42 Irwin Fall 2001 PSU
Floating point addition
Still normalized?
4. Round the significand to the appropriate
number of bits
YesOverflow or
underflow?
Start
No
Yes
Done
1. Compare the exponents of the two numbers.
Shift the smaller number to the right until its
exponent would match the larger exponent
2. Add the significands
3. Normalize the sum, either shifting right and
incrementing the exponent or shifting left
and decrementing the exponent
No Exception
Small ALU
Exponentdifference
Control
ExponentSign Fraction
Big ALU
ExponentSign Fraction
0 1 0 1 0 1
Shift right
0 1 0 1
Increment ordecrement
Shift left or right
Rounding hardware
ExponentSign Fraction
MIPS Floating Point Instructions
MIPS has a separate Floating Point Register File ($f0, $f1, …, $f31) (whose registers are used in pairs for double precision values) with special instructions to load to and store from them
lwcl $f1,54($s2) #$f1 = Memory[$s2+54]
swcl $f1,58($s4) #Memory[$s4+58] = $f1
And supports IEEE 754 single
add.s $f2,$f4,$f6 #$f2 = $f4 + $f6
and double precision operations
add.d $f2,$f4,$f6 #$f2||$f3 =$f4||$f5 +
$f6||$f7
similarly for sub.s, sub.d, mul.s, mul.d, div.s, div.d
MIPS Floating Point Instructions, Con’t
And floating point single precision comparison operations
c.x.s $f2,$f4 #if($f2 < $f4) cond=1;else cond=0
where x may be eq, neq, lt, le, gt, ge
and branch operations
bclt 25 #if(cond==1)go to PC+4+25
bclf 25 #if(cond==0)go to PC+4+25
And double precision comparison operations
c.x.d $f2,$f4 #$f2||$f3 < $f4||$f5 cond=1; else
cond=0
Control Flow for Floating Point Multiplication
Review: MIPS ISA, so farCategory Instr Op Code Example Meaning
Arithmetic
(R & I format)
add 0 and 32 add $s1, $s2, $s3 $s1 = $s2 + $s3
add unsigned 0 and 33 addu $s1, $s2, $s3 $s1 = $s2 + $s3
subtract 0 and 34 sub $s1, $s2, $s3 $s1 = $s2 - $s3
subt unsigned 0 and 35 subu $s1, $s2, $s3 $s1 = $s2 - $s3
add immediate 8 addi $s1, $s2, 6 $s1 = $s2 + 6
add imm. unsigned 9 addiu $s1, $s2, 6 $s1 = $s2 + 6
multiply 0 and 24 mult $s1, $s2 hi || lo = $s1 * $s2
multiply unsigned 0 and 25 multu $s1, $s2 hi || lo = $s1 * $s2
divide 0 and 26 div $s1, $s2 lo = $s1/$s2, rem. in hi
divide unsigned 0 and 27 divu $s1, $s2 lo = $s1/$s2, rem. in hi
Logical
(R & I format)
and 0 and 36 and $s1, $s2, $s3 $s1 = $s2 & $s3
or 0 and 37 or $s1, $s2, $s3 $s1 = $s2 | $s3
xor 0 and 38 xor $s1, $s2, $s3 $s1 = $s2 xor $s3
nor 0 and 39 nor $s1, $s3, $s3 $s1 = !($s2 | $s2)
and immediate 12 andi $s1, $s2, 6 $s1 = $s2 & 6
or immediate 13 ori $s1, $s2, 6 $s1 = $s2 | 6
xor immediate 14 xori $s1, $s2, 6 $s1 = $s2 xor 6
Review: MIPS ISA, so far con’tCategory Instr Op Code Example Meaning
Shift
(R format)
sll 0 and 0 sll $s1, $s2, 4 $s1 = $s2 << 4
srl 0 and 2 srl $s1, $s2, 4 $s1 = $s2 >> 4
sra 0 and 3 sra $s1, $s2, 4 $s1 = $s2 >> 4
Data Transfer
(I format)
load word 35 lw $s1, 24($s2) $s1 = Memory($s2+24)
store word 43 sw $s1, 24($s2) Memory($s2+24) = $s1
load byte 32 lb $s1, 25($s2) $s1 = Memory($s2+25)
load byte unsigned 36 lbu $s1, 25($s2) $s1 = Memory($s2+25)
store byte 40 sb $s1, 25($s2) Memory($s2+25) = $s1
load upper imm 15 lui $s1, 6 $s1 = 6 * 216
move from hi 0 and 16 mfhi $s1 $s1 = hi
move to hi 0 and 17 mthi $s1 hi = $s1
move from lo 0 and 18 mflo $s1 $s1 = lo
move to lo 0 and 19 mtlo $s1 lo = $s1
Review: MIPS ISA, so far con’tCategory Instr Op Code Example Meaning
Cond. Branch
(I & R format)
br on equal 4 beq $s1, $s2, L if ($s1==$s2) go to L
br on not equal 5 bne $s1, $s2, L if ($s1 !=$s2) go to L
set on less than 0 and 42 slt $s1, $s2, $s3 if ($s2<$s3) $s1=1 else $s1=0
set on less than unsigned
0 and 43 sltu $s1, $s2, $s3
if ($s2<$s3) $s1=1 else $s1=0
set on less than immediate
10 slti $s1, $s2, 6 if ($s2<6) $s1=1 else $s1=0
set on less than imm. unsigned
11 sltiu $s1, $s2, 6 if ($s2<6) $s1=1 else $s1=0
Uncond. Jump (J & R format)
jump 2 j 2500 go to 10000
jump and link 3 jal 2500 go to 10000; $ra=PC+4
jump register 0 and 8 jr $s1 go to $s1
jump and link reg 0 and 9 jalr $s1, $s2 go to $s1, $s2=PC+4