Week 5 Lecture slides
description
Transcript of Week 5 Lecture slides
1
COSC 3P92
Cosc 3P92
Week 5 Lecture slides
Voters quickly forget what a man says.Richard M. Nixon (1913-1994) Former U.S. President
2
COSC 3P92
Hardware components MIC(overview)
• MAR and MDR are registers which latch the addresses and data prior to processing
3
COSC 3P92
Hardware components MIC (overview)
• Translate byte address 0, 1, 2, 3… to 4 byte words.– Shift 2 bits left.
– Causes word 0, 1, 2, 3 … to be addressed.
– Alignment of words.
4
COSC 3P92
Hardware components MIC (overview)
• Each micro instruction controls – register enables
– bus enables
– ALU
– Memory
– Next Micro instruction address
5
COSC 3P92
Hardware components MIC (overview)
6
COSC 3P92
Memory control
• MAR - memory address register– CPU writes addresses of memory to read, write
• MBR - memory buffer register– contains data for write or read
• both act as ‘latches’ to hold addr, data until memory finished using them.
7
COSC 3P92
Control unit• main functions of a control unit:
- instruction interpretation
- instruction sequencing
• the control unit is a finite-state machine.
Control Unit
Execution Unit
Status signals Control signals
External command signals
Master clock
CPU
8
COSC 3P92
Typical CPU model
R0
R1
Rn-1
SR (status reg)IR (instn reg)
PC (prog cntr)SP (stack ptr)
MAR (mem addr reg)
MBR (mem buffer reg)
•••
etc...
Generalpurposeregisters
Dedicatedregisters
ALU(arithmetic logic unit)
Control Unit
Dedicated multiply,division firmware(FP)
Execution unit• An execution unit consists of:
– a register section
– an ALU
– some dedicated hardware or firmware
9
COSC 3P92
Data transfer within a CPU• A single-bus architecture:
• To compute R2 <– R0 + R1:1. A <– R0,
2. B <– R1,
3. R2 <– A+B
ALU
Buffer reg. A Buffer reg. B
R0
R1
etc
generalpurposeregs
PC
etc
specialpurposeregs
10
COSC 3P92
Data transfer within a CPU• A two-bus architecture
• To compute R2 <– R0 + R1:1. Buffer <– R0 + R1 (via Bus A and Bus B),
2. R2 <– Buffer (via either Bus A or Bus B).
ALU
R0
R1
etc
PC
etc
BUS A
BUS B
Special I
Special II
MBR
buffer reg.
General regs.
11
COSC 3P92
Data transfer within a CPU
• A three-bus architecture:
• To compute R2 <– R0 + R1:1. R2 <– R0 + R1 (via Bus A, Bus B and Bus C).
ALU
R0
R1
etc
PC
etc
BUS B
BUS C
Special I
Special II
MBR
BUS A
12
COSC 3P92
Design of control units
• Hardwired approach
• The control unit is treated as a synchronous (i.e., clocked) sequential circuit and is implemented as a hardwired state machine.
Register
Register
CombinationalLogic
Inputs
Outputs
Feedback paths
Register Transfer Model ofFinite State Machine
Register
Register
AND plane
Inputs
Outputs
OR plane
Next state
PLA Implementation of aFinite State Machine
13
COSC 3P92
Microprogramming
• Use of memory to implement the control unit
• Instructions are implemented as sequences of instructions stored in control memory
• Each machine language instruction is interpreted by circuitry, and executed using sequences of microprogram instructions
• Micro-programs are much like assembled code, except:
– direct mapping between instruction fields and hardware components of the CPU.
– control fields are specified.
– timing is critical; parallelism can be exploited.
14
COSC 3P92
Microprogramming
Register
Register
CombinationalLogic
Controlvalues
• What is being controlled?
– data paths: inter-register connections
– control points: hardware enabling lines which govern register-to-register communications
• idea is that we can control the operation of ALU and micro-control unit using combinations of control fields encoded in micro-instructions
15
COSC 3P92
Microprogramming
• Each control point specifies a micro-operation– All micro operations which may be executed in parallel can be
specified in a single micro instruction.
• Factors which determine parallel operations.– Buses must only have 1 input active at a time.
– Registers can be either read/written
» Not both at the same time.
16
COSC 3P92
Microprogramming
• Basic microinstruction formats: {Over heads}
17
COSC 3P92
Data path
• 32-bit registers (none are user-accessible)
• B bus: main one to ALU
• C bus: from ALU back to registers
• H reg: contains other operand for ALU
– loaded by performing null op on data, and sending it to H
18
COSC 3P92
Data path
• ALU control: 6 control lines
• shifter: 2 control
– 1. logical shift left 8 bits
– 2. arithmetic shift right 8 bits
19
COSC 3P92
• Four sub-cycles:– 1. control signals set up (w)
– 2. registers loaded on B bus (x)
– 3. ALU and shifter (y)
– 4. results available to registers on C (z)
Data path timing
20
COSC 3P92 Data path timing
• These are implicit sub-cycles: they rely on timing of previous steps
• Only real clock signals used:– falling edge of clock (starts the cycle)
– rising edge (loading from C in step 4)
• ALU is continually processing all intermediate values it sees. It’s output only makes sense at the appropriate time above (after 3)
• Can operate and save a register in 1 clock cycle:– load PC to B
– inc
– save to PC
21
COSC 3P92
Memory again• 2 memory buffers:
– 32 bit port: MAR, MDR (read, write)
» word addresses
– 8-bit: MBR
» low byte from PC (read only)
» byte addresses
» can be loaded signed, unsigned onto B bus
» call reads into MBR “fetches”
• control:– black arrow: enable from C bus
– white arrow: enable onto B bus
• 2 bus control:– out B
– in C
– out B / in C
– none
22
COSC 3P92
Memory again• MAR aligned to words (32 bits, 4 bytes): [4.4]
• Memory is available 2 cycles from when read was initiated– avail. at end of 2nd cycle, so 3rd cycle can use them
23
COSC 3P92
Microinstructions• 29 signals for data path:
– 1. 9 signals to control C bus output into registers
– 2. 9 signals to enable registers onto B bus
– 3. 9 signals for ALU, shifter functions
– 4. 2 signals for memory W/R via MAR/MDR
– 5. 1 signal for memory fetch via PC/MBR
• Issues:– may load more than 1 reg from C (9 bits)
– but never load more than 1 reg onto B (4 bits, encoded will force this) --> 4 signals.
• Need 2 more fields for determining next m.i.:– NextAddr (9 bits, addr space of 512)
– conditional jumps (3 bits)
24
COSC 3P92
Microinstructions
• Fields:– Addr: address of next micro-instruction
– JAM: determines how next m.i. selected
– ALU: ALU, shifter control
– C: which registers written from C bus
– Mem: memory functions
– B: B source (encoded)
25
COSC 3P92
Example micro-
architecture: Mic-1
26
COSC 3P92
Example microarchitecture: Mic-1• sequencer: executes microinstructions
• Two tasks:– set control signals for system
– determine next m.i. to execute
• control store: contains m.i. for interpreting ISA instns.
– each instn a 36-bit word like [4.5]
– each m.i specifies its successor
• MPC: MicroProgram Counter– 9-bit address of next m.i. to execute
• MIR: MicroInstruction Register– 36-bit m.i. being executed
• Note that bits in MIR may directly control other parts of the circuit
– eg. C
27
COSC 3P92
Mic-1 operation cycle• Basic ALU cycle:
– 1. set up the inputs to the ALU
– 2. let the ALU do its computation
– 3. store the results
• Clock cycles for Mic-1– 1. MIR enabled (during subcycle w)
– 2. MIR signals control data path (B bus; note H always enabled) (subcycle x)
– 3. B and H inputs are stable, and ALU’s computes output ; shifter finishes; N, Z bits stable (subcycle y)
– 4. shifter, N, Z outputs loaded from C but into registers
» rising clock edge determines end
» MIR is reloaded and calculated at this point as well
» Memory read is initiated at end too
• Note that all the above will complete in 1 cycle– microinstructions can specify all these operations in parallel
28
COSC 3P92
Mic-1 sequencing• First, 9-bit next addr field copied into MPC
• JAM inspected:– 000 = use MPC as it is– if JAMN (or JAMZ) set, then N bit (or Z) are ORed with high-bit
of MPC» hence next address is either: MPC, MPC with high-bit
ORed with 1
»
– JMPC set: MBR byte ORed with low byte of NextAddr field» permits multiway jumps» can quickly branch to instn for just-loaded opcodes (ie.
opcode number = address in control store!)
29
COSC 3P92
Microinstructions and notation
• As in assembler programming, helps to use higher-level notation instead of raw numeric m.i. fields
• can specify everything that happens in 1 clock cycle:
– permits parallelism: eg. prefetch next instns
• Notation: high-level, but directly translatable to single m.i.’s
• Examples:– SP=SP+1: incr SP by 1
– MDR = SP: copy SP into MDR
– MDR = SP+H; rd : add SP and H, save in MDR, and initiate a read
– SP=MDR=SP+1: incr SP, load into both MDR, SP
30
COSC 3P92
Microinstructions and notation• Memory takes 2 cycles:
MAR=SP; rd : assign value into MDR
(another instn)
* memory ready now!
• next addresses: assume it is the labeled next m.i. after current one (unless a conditional jump)
– if (Z) goto L1; else goto L2 : sets JAMZ
» L1 and L2 are same low-8 bits (set by assembler)
• Summary of legal operations on operands:
31
COSC 3P92
Example M.I. implementation: IJVM• A stack-based virtual machine for which Mic-1 is
designed to implement.
• All instructions access the stack: no general registers are used by compiler
– eg. parameter passing [4.8]
– eg. arithmetic [4.9]
• Recall:– JVM instruction formats: [5.15]
– Java memory usage, registers: [4.10]
• Complete instruction set: [4.11]
• Example translated code: [4.14]
32
COSC 3P92
33
COSC 3P92
JVM Instruction Formats
34
COSC 3P92
Memory area of IJVM
35
COSC 3P92
IJVM Instruction Set
36
COSC 3P92
Translating Java to IJVM
37
COSC 3P92
Implementation (cont)• See overheads (book page 234-236)
• Note:– each m.i. contains address of next instn
– micro-assembler labels all instns appropriately, and must put them in right control store addresses (equiv. to opcode)
– the sequenced instns may reside in any free area of control store! Microassembler auto sets ‘next address fields’.
– only explicit ‘goto’s will override this sequencing
• Two parts:– 1. fetch next byte for next instn (done at Main1)
– 2. branch to that opcode address and carry out instruction
• Fetching instructions (Main1)– PC always points to next instruction in Java application program
– can be reset by branches (see goto5, T, F,...)
– When Main1 executed, assumed next opcode ready. the fetch at Main1 is for next opcode. Hence instns must fetch it if necessary(eg. see bipush2)
38
COSC 3P92
Implementation (cont)
• Example 1: iadd (“pop 2 words from stack, push their sum”)
– iadd1: reads next-to-top word in stack (TOS register already contains top of stack word); bumps down the SP for writing result
– iadd2: sets TOS ready for addition (put in H)
– iadd3: add next-to-top value (read in iadd1) to H, update TOS, save result in MDR for writing
• Example 2: dup (“copy top stack word and push it”)– dup1: incr SP pointer, copy to MAR
– dup2: save TOS (top stack word) to new SP, write it
– note: can’t write it in dup1, because both SP and MDR must be updated thru data path, and not both at once
39
COSC 3P92
Implementation (cont)
• Example 3: goto offset (“unconditional branch”)– [Fig 4.22]
– goto1: save addr of opcode to OPC (old PC)
– goto2: get the 2nd byte of offset (1st byte already in MBR)
– goto3: shift 1st byte left 8 bits
– goto4: OR low byte into high byte
– goto5: add 16-bit offset to (old) PC; get next opcode
– goto6: goto Main1
– Note: pause needed in goto6 (must wait 2 extra cycle)
40
COSC 3P92
41
COSC 3P92
Improving performance
• 1. Faster clock, transistors, electrical circuits
• 2. simpler organization yields shorter clock cycles– eg. get rid of (B bus) decoder
• 3. Merge interpreter loop with microcode (pt 2)– [4.23], [4.24]
– saves extra cycles if done in all instns
– significant speedup!
• 4. Three-busses– [4.25], [4.26]
– reduces need for separate instns to load H reg
42
COSC 3P92
43
COSC 3P92
2 Bus v.s. 3 Bus
44
COSC 3P92
Improving performance
• 5. Instruction fetch unit [4.27]
– in Mic-1, ALU is used to increment PC and fetch instns
– this uses up instn. cycles
– IFU can be used:
» 1. pre-fetches all instns outside of main data path
» 2. pre-fetches operands: if they are required, they are there (else garbage, but ignored anyway)
45
COSC 3P92
Fetch Unit
46
COSC 3P92
Improving performance• Instruction fetch unit (cont)
– shift register: always loaded with next bytes from memory
– MBR1 (1 byte, as before); and new MBR2 (2 bytes)
– values from shift reg dumped into both MBR1, MBR2 after every instn read; if needed, they are quickly put onto data path as req’d
– need some fetching logic to know when to read more bytes into shift register, when to refresh MBR1, MBR2
– IMAR: separate memory addr reg (separate from MAR)
» own dedicated incrementer (no need for ALU)
– IFU must keep PC incremented properly, depending on instn length (if MBR1, MBR2 used)
» branches may reset PC as well (from C)
47
COSC 3P92
Improving performance• Mic-2:
– A, B buses
– IFU
– new IJVM [4.30, See overheads]
» smaller, faster
» MBR1 always has next opcode (due to IFU)
48
COSC 3P92
Mic-2
49
COSC 3P92
Improving performance: 6. Pipelining• divide instn. execution into modular steps and
carry out different steps for seql. instns simultaneously
• “instruction-level parallelism”
• superscalar: single pipeline with parallel functional units
• most instns take more than 1 cycle to complete
• with pipelining: n instns in n cycles
• To implement it: [4.31]– add latch to A, B, C buses
– they keep values stable during sub-cycles: can use values in 3 sections of the data path
» (i) loading before ALU (A, B)
» (ii) doing ALU, shift, and loading C latch
» (iii) storing C back into registers
50
COSC 3P92
Mic-3
51
COSC 3P92
Improving performance: 6. Pipelining
• need 3 cycles now to complete 1 instn– but maximum delay between all components is shorter (1/3) so
can speed up clock
– advantage: throughput -- 3 instns can be processed simult.
– all parts of data path are busy... none are idle (usually)
• best analogy: car factory assembly line
52
COSC 3P92
Pipelining (cont)• [4.32, 4.33, 4.44]
• interpreting instns in pipelined processor (Mic-4):– new sub-cycles: microsteps
– takes 3 cycles to process instn (steps i, ii, iii from earlier)
– call latches A, B, C (like registers)
– advantage [4.33] is that different stages can work independently of one another now
• more stages in pipeline means higher efficiency
53
COSC 3P92
54
COSC 3P92
55
COSC 3P92
Pipelining (cont)
• One complication: memory reads– takes 2 cycles to get word from memory
– hence a m.i. that uses a word in MDR must wait until it’s available
– called a true or RAW (read after write) dependence
– pipeline must stall until it is ready
– ideally, put other m.i. instns in wait states
• Another complication: conditional branches– cannot predict which instn to fetch/put into pipeline
– have to “squash” or “flush” pipeline when a jump ruins sequence of instns
56
COSC 3P92
Pipelines and branch prediction• unconditional branches
– fetch unit needs to know in advance where to access instns– a jump instn. isn’t decoded right away, and so F.U. won’t know
branch location until later: called the delay slot– soln: compiler places other executable instns in delay, that it knows
can be executed
• conditional branches– dynamic prediction: carried out during run time– keep a running table of branched instn addresses, along with a
“branch/no branch” bit– if branch in table, and branch bit set, then predict it will be taken -->
fetch it– can use 2 prediction bits: predict it’s fetched twice, and not fetched
twice (extra logic)
57
COSC 3P92
Pipelines and branch prediction• static branch prediction: carried out during
compile time– if a loop nearly always done, then have a field in the instn.
which tells CPU that branch should be fetched (eg. UltraSPARC)
– can do simulations to determine how cond. branches executed
58
COSC 3P92
Improving performance: out-of-order exec, reg renaming
• instruction ops can take varying # clock cycles– superscalar systems mean those functional units need more
time to process their instns
• problem: can’t exec one instn that requires results of another
– means the pipeline stalls until register values are computed when subsequent instns require them.
• soln: move instruction order, so that no idle waiting
– overall exec must be identical to “linear” order
• dependencies:– RAW (read after write): try to read reg before another instn
has written it.
– WAR (write after read): try to write before another has read it
– WAW (write after write): both write simult.
59
COSC 3P92
In-order exec, in-order completion– decode in cyc n,
exec n+1, writeback n+2 (except multiply in n+3)
– 2 instns decoded simult.
– uses scoreboard: 1 counter per reg keeping track of # instns using it as a source or destination
– keeps track of max # regs that can be processed concurrently
60
COSC 3P92
– idea: execute instns so long as resources are available, and no conflicts
– move order of instns to permit this
– registers are renamed automatically to reduce conflicts: “secret regs”
» eg. if a register is in conflict, rename it so conflict is removed.
» copy values to original named reg later if required.
– result: huge performance gain (we’re trying to make pipeline maximally useful!)
Out-of-order exec, reg renaming (cont)
61
COSC 3P92
Improving performance: speculative exec• block: a section of sequential code [4.45]
• Can increase throughput by moving instructions beyond their blocks
– hoisting: moving an instruction over a branch
• speculative execution: executing an instruction before it is known whether it will be needed
– OK to do it so long as there is no side effect (eg. write to memory, trap/interrupt)
– may sometimes cause slowdown if spec. exec fetches an instn from memory that isn’t needed
– otherwise, idea is to move slower instructions up the queue so that their processing can occur in the interim
• some solns:– speculative instns: only fetch/exec instructions that are in the cache
– poison bits: don’t set traps automatically; wait until that instn actually executed, and if a poison bit is set, then set the trap
62
COSC 3P92
Speculative exec
63
COSC 3P92
Example 1: Pentium II• 1. Fetch/decode [4.46]
– fetches instns and breaks them into m.i.’s
• 2.dispatch/exec– takes m.i.’s and execs them
• 3. retirement unit– completes exec, stores reg values (speculative exec)
• 1, 2, 3 above act as high-level pipeline• ROB (reorder buffer): table of m.i.’s to execute• Fetch/decode [4.47]
– 7-stage pipeline– multiple formats, sizes means instn decoding is involved– analyzes instns to determine: size, branch-prediction– usually between 1 and 4 m.i.’s per ISA instn.– uses reg renaming– both static, dynamic branch prediction used
• Dispatch/exec [4.48]
– 5 m.i.’s can be exec’d at once
64
COSC 3P92
P2-micro architecture
65
COSC 3P92
66
COSC 3P92
Example 2: UltraSPARC II• [4.49]
• RISC: all instns are 3-register microinstns already
• branch prediction: (i) cache flags; (ii) 2-bit prediction; (iii) compiler directions in instns
• tries to exec 4 instns in parallel all the time– instns may be executed out of order
• 9-stage pipeline [4.50]
– split integer, float pipelines
– int adds 2 stages (N1, N2) to keep it same as fp
67
COSC 3P92
UltraSPARC
68
COSC 3P92
UltraSPARC Pipeline
69
COSC 3P92
Example 3: picoJava II• [4.51]
• instn, data caches are optional
• register file (64 entries)– contains top 64 words of stack
– dribbling: reg file read/written to memory when it gets too empty/full
– “free” access, w/o accessing caches (which may not be used)
70
COSC 3P92
71
COSC 3P92
• 6-stage pipeline [4.52]
– CISC instns
– not superscalar: instns fetched, retired inorder (unlike Pentium II)
• no branch prediction alg (economy)
72
COSC 3P92
Folding• Folding [4.53, 4.54, 4.55]
– replace a set of m.i.’s with one m.i.
– looks up patterns in a table [4.55], and replaces with equivalent m.i.
– only possible if operands are high in stack, in register file
– huge gain in speed, like RISC performance
73
COSC 3P92
74
COSC 3P92
75
COSC 3P92
Comparing these examples
• common features– all m.i.’s contain opcode, 2 source regs, dest reg
– 1 m.i. per cycle
– deep pipelines
– split instn and data caches
• Pentium II: complexity is in deconstructing its CISC instns into micro-operations
• JVM: complexity is in folding sets of m.i.’s into single operations
• UltraSparc most straight-forward to implement, because instns require minimal decoding (all RISC instructions are micro-operations already!)
76
COSC 3P92
The end