1 ECE 371 Microprocessors Chapter 2 Microprocessors: A High Level View Herbert G. Mayer, PSU Status...
-
Upload
jack-bryant -
Category
Documents
-
view
217 -
download
0
Transcript of 1 ECE 371 Microprocessors Chapter 2 Microprocessors: A High Level View Herbert G. Mayer, PSU Status...
1
ECE 371Microprocessors
Chapter 2
Microprocessors: A High Level View
Herbert G. Mayer, PSUStatus 10/13/2015
For use at CCUT Fall 2015Some Material inherited from Eric Krause @ PSU MS ECE
2
Syllabus
Introduction Block Diagrams Simple µP Operation Arithmetic Logic Unit ALU BCD Arithmetic Instruction Set Architecture (ISA) Iron Law of µP Performance Amdahl’s Law Registers and Score Board Bibliographys
3
Introduction
A microprocessor µP is functionally similar to an old-fashioned mainframe (MF) computer, more even to a typical desktop (DT) computer
Yet there are notable differences! Like a Main Frame (MF) and desktop (DT), a
µP:Has a defined instruction set, named ISA: for Instruction Set Architecture
Has a defined data width, e.g. 8-bit bytes, 60-bit words, etc
Has a defined address range, typically 232 or 264
Includes an arithmetic unit for integer, floating-point, BCD, and bit operations
Includes a memory subsystem, plus several levels of caches
Includes an IO subsystem and has other peripheral devices
4
IntroductionDifferent from MF computer or DT, a typical
µP:Has a smaller power envelope, in order to be usable in laptops
Uses a lower main frequency, in order to generate less heat, also enabling laptop use with long battery life
Manages power consumption and heat generation through frequency variation, and module shut-down
Integrates some peripheral devices onto the same silicon, again saving electrical power, saving board space, plus gaining some speed
Provides a smaller number of duplicated arithmetic logical units (ALU) that compute data; as a result some computations, giving identical clock frequencies, run more slowly on a µP due to the lower degree of parallel execution of instructions
Must fit into a small volume of space, e.g. into a laptop; as a result, there exist fewer options for cooling the actual CPUs
5
Introduction
A µP CPU performs arithmetic-logical operations
Memory holds information, has a path to/from CPU via main bus
How does the µP interact with the real world?
IO ports connect external peripheral devices with HW registers ri, and allow ri to access memory
Buses connect components E.g. data bus, address bus,
specific enabling bits
6
Introduction IO ports connect a variety of IO devices Could be a simple latch to read/write Or a separate embedded system, e.g. an
ethernet controller Common: General Purpose IO, or GPIO, which
exposes raw bits IO controller and memory controller
together can become as complex as the CPU E.g. on a motherboard the CPU chip,
generally the largest component, is surrounded by other, good-sized chips
Trend is to relocate some of their functions onto the same die as the CPU
7
Block Diagram of Itanium µP
8
Actual Photo of Itanium2 µP
9
Block Diagram of AMD Athlon
10
Block Diagram of Compaq Alpha
11
Die Image of Compaq Alpha
12
Simple µP Operation We’ll analyze a
hypothetical, abstract µP operation
In effect the simplest generic operation is:
1. Read numeric input from a keyboard; here we read an integer from the keyboard
2. To perform an operation on the read operand: addition of constant integer 7
3. Followed by displaying the newly computed result
13
Simple µP Operation, Step 1
1. Read numeric (e.g. integer) input data from keyboard, port 5:
µP sends memory address for Port Input Instruction on the address bus
µP sends memory read control signal
Memory sends machine code for Port Read to µP
µP decodes instruction, sends out port 5 on address bus
µP sends control signal for Port Read
Input port sends µP one byte from port 5, by placing it on the data bus
14
Simple µP Operation, Step 2
2. Add integer constant 7 to the data read from port 5:
µP sends memory address for Add instruction on address bus
µP sends control signal for Read on control bus
Memory returns machine code for Add
µP adds integer constant 7 to the value previously read from input port 5
15
Simple µP Operation, Step 3
3. Display newly computed integer value:
µP sends memory address for Port Output instruction on address bus
µP sends memory read control signal
Memory returns opcode (machine code) for Port Output instruction
µP decodes instruction, sends output port address (0x7) on address bus
µP sends sum from register on data bus
µP sends port write signal on control bus
16
Arithmetic Logic Unit ALU Full adder produces COUT and SUM
using 3 1-bit inputs, A B and CIN
Just via logic gates
What does this do? See result below right:
17
Arithmetic Logic Unit ALU
18
Arithmetic Logic Unit ALU
This simple ALU performs any one of 16 arithmetic functions, on two 4-bit words and a carry in
And what determines the function it performs?
Selection plus MODE! We are going to program this device to
perform a function on the two 4-bit inputs and produce a result on the output pins
19
Arithmetic Logic Unit ALU
S3 S2 S1 S0
A= A3 A2 A1 A0 L L L L B= B3 B2 B1 B0 L L L H
Basic Format F= F3 F2 F1 F0 L L H L L L H H A= 0 0 1 1 L H L L B= 0 1 0 1 L H L H
Logic OR F = A + B 0 1 1 1 L H H L L H H H A= 0 0 1 1 H L L L B= 0 1 0 1 H L L H
Logic AND F= A * B 0 0 0 1 H L H L H L H H A= 0 0 1 1 H H L L B= 0 1 0 1 H H L H
Logic XOR F= A × B 0 1 1 0 H H H L H H H H
20
Arithmetic Logic Unit ALU Bitwise OR, meaning fi = ai + bi, e.g. f0 =
a0 + b0, etc, as if there were a dual-input OR gate for each bit of 2 inputs
ALU can perform any of these functions based on how it is programmed!
Thus we have a device usable in different ways based on program
That is a fundamental part of a microprocessor
How do we program it? With an operational code (SELECT lines). We call this the OpCode
We add 7; but how can this be done? Use the OpCode for Add, i.e. a binary
value applied to the select code, to select an operation
21
Binary Coded Decimal Arithmetic
Binary Coded Decimal is a machine-internal data type
BCD uses 4 bits per decimal digit But computes in binary, to re-use integer
ALU part of the processor Since only 10 of 16 numeric choices are
used, it will be necessary to add 6 to each decimal digit computed
I.e. it is necessary add 6 to each group of 4 bits
The raw binary addition would be correct only in binary, and incorrect in BCD
Adding 6 per BCD digit is that needed correction
22
Binary Coded Decimal Arithmetic
With correction, BCD ALU part works like integer add part, with the additional step of adding 6, but the adder is already implemented in HW
Only a constant storage for decimal 6 is needed
Correction algorithm: Given 2 BCD digits A and B:
if sum( A, B ) >= 10102 or carry( A, B ) = 1 then
sum = add( sum( A, B ), 01102 );
end if;
23
Binary Coded Decimal Arithmetic
24
Binary Coded Decimal Arithmetic
25
Binary Coded Decimal Arithmetic
26
BCD Correction Circuit
27
BCD Correction Circuit
ALU is in the center of above diagram Register File is an ordered sequence of
flip-flops that can be accessed collectively
Core of algorithm is to do 2 adds The /4 means: this is a 4-bit line MuxB will either take the second operand
or add 0, or add 0x110 It has 3 4-input options on it Registers in front of mux hold values so
they are stable for the mux Output register holds value of first
addition, usable for second addition
28
BCD Correction Circuit
To read the 2 operands from the reg file, send them into the A/B registers and do the first addition
Then use the second addition. It will take the result of the first addition, then MUX B will switch to add 6 or 0
how do we which? Inspecting the flag register
In this simple case, one bit stores carry information
Output register result goes into the correction detection logig:
It decides, to add 0 or add 6 Combinational logic: If Carry, or if one
of these 6 signals, assert the FIX signal that goes to the controller
29
BCD Correction Circuit
1. Select desired operand registers in register file
2. Transfer operand A from register file to A operand register
3. Transfer operand B from register file to B operand register
4. Set MUX A to select A for ALU
5. Set MUX B to select B for ALU
6. Apply ADD code to ALU select inputs
7. Transfer first sum to output of ALU Output register
8. Transfer COUT and other flags to output of Status register
9. Set MUX A to select temporary sum for the ALU A input
10.Set MUX B to select correction 0110 or 0000, based on 1st result
11.Transfer final sum to ALU output register
12.Select register in register file to write final result
13.Write result to selected register in register file
30
Instruction Set Architecture (ISA) ISA is boundary between Software (SW) and
Hardware (HW) Specifies logical machine that is visible
to the programmer & compiler Is functional specification for processor
designers That boundary is sometimes a very low-
level piece of system software that handles exceptions, interrupts, and HW-specific services
Could fall into the domain of the OS
31
Instruction Set Architecture (ISA) Specified by an ISA: Operations: what to perform and in which
order Temporary Operand Storage in the CPU:
accumulator, stack, registers Note that stack can be word-sized, even
bit-sized (design of successor for NCR’s Century architecture of the 1970s)
Number of operands per instruction Operand location: where and how to
specify/locate the operands Type and size of operands Instruction Encoding in binary
32
Instruction Set Architecture
ISA: Dynamic Static Interface (DSI)
33
Iron Law of µP Performance
Clock-rate doesn’t count, bus width doesn’t count, the number of registers and operations executed in parallel doesn’t count!
What counts is how long it takes for my computational task to complete. That time is of the essence of computing!
If a MIPS-based solution runs at 1 GHz, completing a program X in 2 minutes, while an Intel Pentium® 4–based program runs at 3 GHz and completes that same program x in 2.5 minutes, programmers are more interested in the MIPS solution
34
Iron Law of µP Performance
If a solution on an Intel CPU can be expressed in an object program of size Y bytes, but on an IBM architecture of size 1.1 Y bytes, the Intel solution is generally more attractive
Assuming same execution, performance Meaning of this:
Wall-clock time (Time) is time the user has to wait for program completion
Program Size is indicator of overall physical complexity of computational task
35
Iron Law of µP Performance
36
Amdahl’s Law Articulated by Gene Amdahl During 1967 AFIPS conference; yes,
computers existed then already Stating that the maximum speedup of
a program P is dominated by its sequential portion S
I.e. if some part of P can be perfectly or infinitely accelerated due to numerous parallel processors, but some part S of P is inherently sequential, then the resulting performance is dominated by S
See Wikipedia sample:
37
Amdahl’s Law (From Wikipedia)The speedup of a program using multiple processors in parallel computing is limited by the sequential fraction of the program. For example, if 95% of the program can be parallelized, the theoretical maximum speedup using parallel computing would be 20, no matter how many processors are used
n = element of N, N number of processors
B = element of { 0, 1 }
T(n) = time to execute with n processors
T(n) = T(1) ( B + (1-B) / n )
S(n) = Speedup T(1) / T(n)
S(n) = 1 / ( B + (1 – B ) / n )
38
Amdahl’s Law (From Wikipedia)
39
Registers,Dependences,Score Board
in a Microprocessor
40
Register Dependencies Inter-instruction dependencies, in
engineering parlance also known as dependences, arise between registers or memory locations being defined and used
One instruction computes a result into a register (or memory); another instruction needs that result from that same register, or from that same memory location
Or, one instruction uses a resource; and after its use the same resource is newly recomputed
Dependences require sequential execution, lest the result is unpredictable
41
Register DependenciesTrue-Dependence, AKA Data Dependence: <- synonymous!r3 ← r1 op r2 1: r5 ← r3 op r4 2: Read after Write, RAW
Anti-Dependence, not a true dependenceparallelize under right conditionr3 ← r1 op r2 1:r1 ← r5 op r4 2: Write after Read, WAR
Output Dependence, similar to Anti-Dependence: can do something
r3 ← r1 op r2 1: Writer5 ← r3 op r4 2: User3 ← r6 op r7 3: Write after Write, WAW, use
between
42
Register Dependencies
Control Dependence:
// ri, i = 1..4 come in “live”
if ( condition1 ) {
r3 = r1 op r2;
}else{ see the jump here?
r5 = r3 op r4;
} // end if
write( r3 );
43
Register Renaming Only data dependence is a real
dependence, hence called true dependence
Other dependences are artifacts of insufficient resources, generally insufficient registers
This means: if additional registers were available, then replacing some of these conflicting registers with new, other registers, could make the conflict disappear!
Anti- and Output-Dependences are indeed such false dependences
44
Register Renaming
Original Code:-- r2, r3, r5, r6, r7 come in “live” ... code before
-- r1, r4 are not “live”, don’t have values
L1: r1 ← r2 op r3
L2: r4 ← r1 op r5
L3: r1 ← r3 op r6
L4: r3 ← r1 op r7
Dependences before:
Lx: Ly: x, y = 1..4, which dependence?
45
Register RenamingOriginal Code:L1: r1 ← r2 op r3
L2: r4 ← r1 op r5
L3: r1 ← r3 op r6
L4: r3 ← r1 op r7
Initial Dependences, lots of dependences:
L1, L2 true-Dep with r1
L1, L3 output-Dep with r1
L1, L4 anti-Dep with r3
L3, L4 true-Dep with r1
L2, L3 anti-Dep with r1
L3, L4 anti-Dep with r3s
46
Register RenamingWhat could we change, if we had some additional registers?
Compute and use other temporaries via other registers to reduce dependences!
That sometimes allows a higher degree of parallelism, due to a lower degree of dependences
More parallelism means faster execution!
47
Register RenamingOriginal Code: New Code, after adding
regs:
L1: r1 ← r2 op r3 r10 ← r2 op r30 –- r30 instead
L2: r4 ← r1 op r5 r4 ← r10 op r5 –- r10 instead
L3: r1 ← r3 op r6 r1 ← r30 op r6
L4: r3 ← r1 op r7 r3 ← r1 op r7
Dependences before: Dependences after:
L1, L2 true-Dep with r1 L1, L2 true-Dep with r10
L1, L3 output-Dep with r1 L3, L4 true-Dep with r1
L1, L4 anti-Dep with r3 // ri, i = 1..7 are “live”
L3, L4 true-Dep with r1
L2, L3 anti-Dep with r1
L3, L4 anti-Dep with r3
48
Register Renaming• With these additional, renamed regs, the new
code could possibly run in half the time!
• First: Compute into r10 instead of r1, but you need to have such an additional register r10; no time penalty!
• Also: Compute in preceding code into r30 instead of r3, if r30 available; also no time penalty!
• Then the following regs are live afterwards: r1, r3, r4, plus the non-modified ones, i.e. r2! r2 came in live, must go out live!
• While r10 and r30 are don’t cares afterwards
49
Score BoardScore-board: array of single-bit HW programmable resources sb[]
Manages other HW resources, specifically registers
In this single-bit HW array sb[], every bit i in sb[i] is associated with a specific machine register, the one identified by i , i.e. ri
Association is by index, i.e. by name: sb[i] belongs to reg ri
Only if sb[i] = 0, does that register i have valid data, and can therefore be used
If sb[i] = 0 then register ri is NOT in process of being written
If bit i is set, i.e. if sb[i] = 1, then that register ri is reserved; i.e it cannot be used
Initially all sb[*] set to 0; hence all registers ri can be used
50
Score BoardExecution constraints:
rd ← rs op rt
if sb[s] or if sb[t] is set → RAW dependence, hence stall the computation; wait until both rs and rt are available
if sb[d] is set→ WAW dependence, hence stall the write; wait until rd has been used; SW can sometimes determine to use another register instead of rd
Else, if none of the 3 registers are in use, dispatch the instruction immediately
51
Score BoardOut of order (ooo) execution has long been common on microprocessors, where speed is critical
To allow ooo execution, upon computing the value of rd : update rd, and clear sb[d]
For uses (reads), HW may use any register i, whose sb[i] is 0
For definitions (writes), HW may set any register j, whose sb[j] is 0
Independent of original order, in which source program was written, i.e. possibly ooo
Ooo instructions retire in original order
52
Bibliography1. http://en.wikipedia.org/wiki/Flynn's_taxonomy
2. http://www.ajwm.net/amayer/papers/B5000.html
3. http://www.robelle.com/smugbook/classic.html
4. http://www.intel.com/design/itanium/manuals.htm
5. http://www.csupomona.edu/~hnriley/www/VonN.html
6. http://cva.stanford.edu/classes/ee482s/scribed/lect11.pdf