Week 5 Lecture slides

1

COSC 3P92

Cosc 3P92

Week 5 Lecture slides

Voters quickly forget what a man says.Richard M. Nixon (1913-1994) Former U.S. President

2

COSC 3P92

Hardware components MIC(overview)

• MAR and MDR are registers which latch the addresses and data prior to processing

3

COSC 3P92

Hardware components MIC (overview)

• Translate byte address 0, 1, 2, 3… to 4 byte words.– Shift 2 bits left.

– Causes word 0, 1, 2, 3 … to be addressed.

– Alignment of words.

4

COSC 3P92


• Each micro instruction controls – register enables

– bus enables

– ALU

– Memory

– Next Micro instruction address

5

COSC 3P92


6

COSC 3P92

Memory control

• MAR - memory address register– CPU writes addresses of memory to read, write

• MBR - memory buffer register– contains data for write or read

• both act as ‘latches’ to hold addr, data until memory finished using them.

7

COSC 3P92

Control unit• main functions of a control unit:

- instruction interpretation

- instruction sequencing

• the control unit is a finite-state machine.

Control Unit

Execution Unit

Status signals Control signals

External command signals

Master clock

CPU

8

COSC 3P92

Typical CPU model

R0

R1

Rn-1

SR (status reg)IR (instn reg)

PC (prog cntr)SP (stack ptr)

MAR (mem addr reg)

MBR (mem buffer reg)

•••

etc...

Generalpurposeregisters

Dedicatedregisters

ALU(arithmetic logic unit)

Control Unit

Dedicated multiply,division firmware(FP)

Execution unit• An execution unit consists of:

– a register section

– an ALU

– some dedicated hardware or firmware

9

COSC 3P92

Data transfer within a CPU• A single-bus architecture:

• To compute R2 <– R0 + R1:1. A <– R0,

2. B <– R1,

3. R2 <– A+B

ALU

Buffer reg. A Buffer reg. B

R0

R1

etc

generalpurposeregs

PC

etc

specialpurposeregs

10

COSC 3P92

Data transfer within a CPU• A two-bus architecture

• To compute R2 <– R0 + R1:1. Buffer <– R0 + R1 (via Bus A and Bus B),

2. R2 <– Buffer (via either Bus A or Bus B).

ALU

R0

R1

etc

PC

etc

BUS A

BUS B

Special I

Special II

MBR

buffer reg.

General regs.

11

COSC 3P92

Data transfer within a CPU

• A three-bus architecture:

• To compute R2 <– R0 + R1:1. R2 <– R0 + R1 (via Bus A, Bus B and Bus C).

ALU

R0

R1

etc

PC

etc

BUS B

BUS C

Special I

Special II

MBR

BUS A

12

COSC 3P92

Design of control units

• Hardwired approach

• The control unit is treated as a synchronous (i.e., clocked) sequential circuit and is implemented as a hardwired state machine.

Register

Register

CombinationalLogic

Inputs

Outputs

Feedback paths

Register Transfer Model ofFinite State Machine

Register

Register

AND plane

Inputs

Outputs

OR plane

Next state

PLA Implementation of aFinite State Machine

13

COSC 3P92

Microprogramming

• Use of memory to implement the control unit

• Instructions are implemented as sequences of instructions stored in control memory

• Each machine language instruction is interpreted by circuitry, and executed using sequences of microprogram instructions

• Micro-programs are much like assembled code, except:

– direct mapping between instruction fields and hardware components of the CPU.

– control fields are specified.

– timing is critical; parallelism can be exploited.

14

COSC 3P92

Microprogramming

Register

Register

CombinationalLogic

Controlvalues

• What is being controlled?

– data paths: inter-register connections

– control points: hardware enabling lines which govern register-to-register communications

• idea is that we can control the operation of ALU and micro-control unit using combinations of control fields encoded in micro-instructions

15

COSC 3P92

Microprogramming

• Each control point specifies a micro-operation– All micro operations which may be executed in parallel can be

specified in a single micro instruction.

• Factors which determine parallel operations.– Buses must only have 1 input active at a time.

– Registers can be either read/written

» Not both at the same time.

16

COSC 3P92

Microprogramming

• Basic microinstruction formats: {Over heads}

17

COSC 3P92

Data path

• 32-bit registers (none are user-accessible)

• B bus: main one to ALU

• C bus: from ALU back to registers

• H reg: contains other operand for ALU

– loaded by performing null op on data, and sending it to H

18

COSC 3P92

Data path

• ALU control: 6 control lines

• shifter: 2 control

– 1. logical shift left 8 bits

– 2. arithmetic shift right 8 bits

19

COSC 3P92

• Four sub-cycles:– 1. control signals set up (w)

– 2. registers loaded on B bus (x)

– 3. ALU and shifter (y)

– 4. results available to registers on C (z)

Data path timing

20

COSC 3P92 Data path timing

• These are implicit sub-cycles: they rely on timing of previous steps

• Only real clock signals used:– falling edge of clock (starts the cycle)

– rising edge (loading from C in step 4)

• ALU is continually processing all intermediate values it sees. It’s output only makes sense at the appropriate time above (after 3)

• Can operate and save a register in 1 clock cycle:– load PC to B

– inc

– save to PC

21

COSC 3P92

Memory again• 2 memory buffers:

– 32 bit port: MAR, MDR (read, write)

» word addresses

– 8-bit: MBR

» low byte from PC (read only)

» byte addresses

» can be loaded signed, unsigned onto B bus

» call reads into MBR “fetches”

• control:– black arrow: enable from C bus

– white arrow: enable onto B bus

• 2 bus control:– out B

– in C

– out B / in C

– none

22

COSC 3P92

Memory again• MAR aligned to words (32 bits, 4 bytes): [4.4]

• Memory is available 2 cycles from when read was initiated– avail. at end of 2nd cycle, so 3rd cycle can use them

23

COSC 3P92

Microinstructions• 29 signals for data path:

– 1. 9 signals to control C bus output into registers

– 2. 9 signals to enable registers onto B bus

– 3. 9 signals for ALU, shifter functions

– 4. 2 signals for memory W/R via MAR/MDR

– 5. 1 signal for memory fetch via PC/MBR

• Issues:– may load more than 1 reg from C (9 bits)

– but never load more than 1 reg onto B (4 bits, encoded will force this) --> 4 signals.

• Need 2 more fields for determining next m.i.:– NextAddr (9 bits, addr space of 512)

– conditional jumps (3 bits)

24

COSC 3P92

Microinstructions

• Fields:– Addr: address of next micro-instruction

– JAM: determines how next m.i. selected

– ALU: ALU, shifter control

– C: which registers written from C bus

– Mem: memory functions

– B: B source (encoded)

25

COSC 3P92

Example micro-

architecture: Mic-1

26

COSC 3P92

Example microarchitecture: Mic-1• sequencer: executes microinstructions

• Two tasks:– set control signals for system

– determine next m.i. to execute

• control store: contains m.i. for interpreting ISA instns.

– each instn a 36-bit word like [4.5]

– each m.i specifies its successor

• MPC: MicroProgram Counter– 9-bit address of next m.i. to execute

• MIR: MicroInstruction Register– 36-bit m.i. being executed

• Note that bits in MIR may directly control other parts of the circuit

– eg. C

27

COSC 3P92

Mic-1 operation cycle• Basic ALU cycle:

– 1. set up the inputs to the ALU

– 2. let the ALU do its computation

– 3. store the results

• Clock cycles for Mic-1– 1. MIR enabled (during subcycle w)

– 2. MIR signals control data path (B bus; note H always enabled) (subcycle x)

– 3. B and H inputs are stable, and ALU’s computes output ; shifter finishes; N, Z bits stable (subcycle y)

– 4. shifter, N, Z outputs loaded from C but into registers

» rising clock edge determines end

» MIR is reloaded and calculated at this point as well

» Memory read is initiated at end too

• Note that all the above will complete in 1 cycle– microinstructions can specify all these operations in parallel

28

COSC 3P92

Mic-1 sequencing• First, 9-bit next addr field copied into MPC

• JAM inspected:– 000 = use MPC as it is– if JAMN (or JAMZ) set, then N bit (or Z) are ORed with high-bit

of MPC» hence next address is either: MPC, MPC with high-bit

ORed with 1

»

– JMPC set: MBR byte ORed with low byte of NextAddr field» permits multiway jumps» can quickly branch to instn for just-loaded opcodes (ie.

opcode number = address in control store!)

29

COSC 3P92

Microinstructions and notation

• As in assembler programming, helps to use higher-level notation instead of raw numeric m.i. fields

• can specify everything that happens in 1 clock cycle:

– permits parallelism: eg. prefetch next instns

• Notation: high-level, but directly translatable to single m.i.’s

• Examples:– SP=SP+1: incr SP by 1

– MDR = SP: copy SP into MDR

– MDR = SP+H; rd : add SP and H, save in MDR, and initiate a read

– SP=MDR=SP+1: incr SP, load into both MDR, SP

30

COSC 3P92

Microinstructions and notation• Memory takes 2 cycles:

MAR=SP; rd : assign value into MDR

(another instn)

* memory ready now!

• next addresses: assume it is the labeled next m.i. after current one (unless a conditional jump)

– if (Z) goto L1; else goto L2 : sets JAMZ

» L1 and L2 are same low-8 bits (set by assembler)

• Summary of legal operations on operands:

31

COSC 3P92

Example M.I. implementation: IJVM• A stack-based virtual machine for which Mic-1 is

designed to implement.

• All instructions access the stack: no general registers are used by compiler

– eg. parameter passing [4.8]

– eg. arithmetic [4.9]

• Recall:– JVM instruction formats: [5.15]

– Java memory usage, registers: [4.10]

• Complete instruction set: [4.11]

• Example translated code: [4.14]

32

COSC 3P92

33

COSC 3P92

JVM Instruction Formats

34

COSC 3P92

Memory area of IJVM

35

COSC 3P92

IJVM Instruction Set

36

COSC 3P92

Translating Java to IJVM

37

COSC 3P92

Implementation (cont)• See overheads (book page 234-236)

• Note:– each m.i. contains address of next instn

– micro-assembler labels all instns appropriately, and must put them in right control store addresses (equiv. to opcode)

– the sequenced instns may reside in any free area of control store! Microassembler auto sets ‘next address fields’.

– only explicit ‘goto’s will override this sequencing

• Two parts:– 1. fetch next byte for next instn (done at Main1)

– 2. branch to that opcode address and carry out instruction

• Fetching instructions (Main1)– PC always points to next instruction in Java application program

– can be reset by branches (see goto5, T, F,...)

– When Main1 executed, assumed next opcode ready. the fetch at Main1 is for next opcode. Hence instns must fetch it if necessary(eg. see bipush2)

38

COSC 3P92

Implementation (cont)

• Example 1: iadd (“pop 2 words from stack, push their sum”)

– iadd1: reads next-to-top word in stack (TOS register already contains top of stack word); bumps down the SP for writing result

– iadd2: sets TOS ready for addition (put in H)

– iadd3: add next-to-top value (read in iadd1) to H, update TOS, save result in MDR for writing

• Example 2: dup (“copy top stack word and push it”)– dup1: incr SP pointer, copy to MAR

– dup2: save TOS (top stack word) to new SP, write it

– note: can’t write it in dup1, because both SP and MDR must be updated thru data path, and not both at once

39

COSC 3P92

Implementation (cont)

• Example 3: goto offset (“unconditional branch”)– [Fig 4.22]

– goto1: save addr of opcode to OPC (old PC)

– goto2: get the 2nd byte of offset (1st byte already in MBR)

– goto3: shift 1st byte left 8 bits

– goto4: OR low byte into high byte

– goto5: add 16-bit offset to (old) PC; get next opcode

– goto6: goto Main1

– Note: pause needed in goto6 (must wait 2 extra cycle)

40

COSC 3P92

41

COSC 3P92

Improving performance

• 1. Faster clock, transistors, electrical circuits

• 2. simpler organization yields shorter clock cycles– eg. get rid of (B bus) decoder

• 3. Merge interpreter loop with microcode (pt 2)– [4.23], [4.24]

– saves extra cycles if done in all instns

– significant speedup!

• 4. Three-busses– [4.25], [4.26]

– reduces need for separate instns to load H reg

42

COSC 3P92

43

COSC 3P92

2 Bus v.s. 3 Bus

44

COSC 3P92

Improving performance

• 5. Instruction fetch unit [4.27]

– in Mic-1, ALU is used to increment PC and fetch instns

– this uses up instn. cycles

– IFU can be used:

» 1. pre-fetches all instns outside of main data path

» 2. pre-fetches operands: if they are required, they are there (else garbage, but ignored anyway)

45

COSC 3P92

Fetch Unit

46

COSC 3P92

Improving performance• Instruction fetch unit (cont)

– shift register: always loaded with next bytes from memory

– MBR1 (1 byte, as before); and new MBR2 (2 bytes)

– values from shift reg dumped into both MBR1, MBR2 after every instn read; if needed, they are quickly put onto data path as req’d

– need some fetching logic to know when to read more bytes into shift register, when to refresh MBR1, MBR2

– IMAR: separate memory addr reg (separate from MAR)

» own dedicated incrementer (no need for ALU)

– IFU must keep PC incremented properly, depending on instn length (if MBR1, MBR2 used)

» branches may reset PC as well (from C)

47

COSC 3P92

Improving performance• Mic-2:

– A, B buses

– IFU

– new IJVM [4.30, See overheads]

» smaller, faster

» MBR1 always has next opcode (due to IFU)

48

COSC 3P92

Mic-2

49

COSC 3P92

Improving performance: 6. Pipelining• divide instn. execution into modular steps and

carry out different steps for seql. instns simultaneously

• “instruction-level parallelism”

• superscalar: single pipeline with parallel functional units

• most instns take more than 1 cycle to complete

• with pipelining: n instns in n cycles

• To implement it: [4.31]– add latch to A, B, C buses

– they keep values stable during sub-cycles: can use values in 3 sections of the data path

» (i) loading before ALU (A, B)

» (ii) doing ALU, shift, and loading C latch

» (iii) storing C back into registers

50

COSC 3P92

Mic-3

51

COSC 3P92

Improving performance: 6. Pipelining

• need 3 cycles now to complete 1 instn– but maximum delay between all components is shorter (1/3) so

can speed up clock

– advantage: throughput -- 3 instns can be processed simult.

– all parts of data path are busy... none are idle (usually)

• best analogy: car factory assembly line

52

COSC 3P92

Pipelining (cont)• [4.32, 4.33, 4.44]

• interpreting instns in pipelined processor (Mic-4):– new sub-cycles: microsteps

– takes 3 cycles to process instn (steps i, ii, iii from earlier)

– call latches A, B, C (like registers)

– advantage [4.33] is that different stages can work independently of one another now

• more stages in pipeline means higher efficiency

53

COSC 3P92

54

COSC 3P92

55

COSC 3P92

Pipelining (cont)

• One complication: memory reads– takes 2 cycles to get word from memory

– hence a m.i. that uses a word in MDR must wait until it’s available

– called a true or RAW (read after write) dependence

– pipeline must stall until it is ready

– ideally, put other m.i. instns in wait states

• Another complication: conditional branches– cannot predict which instn to fetch/put into pipeline

– have to “squash” or “flush” pipeline when a jump ruins sequence of instns

56

COSC 3P92

Pipelines and branch prediction• unconditional branches

– fetch unit needs to know in advance where to access instns– a jump instn. isn’t decoded right away, and so F.U. won’t know

branch location until later: called the delay slot– soln: compiler places other executable instns in delay, that it knows

can be executed

• conditional branches– dynamic prediction: carried out during run time– keep a running table of branched instn addresses, along with a

“branch/no branch” bit– if branch in table, and branch bit set, then predict it will be taken -->

fetch it– can use 2 prediction bits: predict it’s fetched twice, and not fetched

twice (extra logic)

57

COSC 3P92

Pipelines and branch prediction• static branch prediction: carried out during

compile time– if a loop nearly always done, then have a field in the instn.

which tells CPU that branch should be fetched (eg. UltraSPARC)

– can do simulations to determine how cond. branches executed

58

COSC 3P92

Improving performance: out-of-order exec, reg renaming

• instruction ops can take varying # clock cycles– superscalar systems mean those functional units need more

time to process their instns

• problem: can’t exec one instn that requires results of another

– means the pipeline stalls until register values are computed when subsequent instns require them.

• soln: move instruction order, so that no idle waiting

– overall exec must be identical to “linear” order

• dependencies:– RAW (read after write): try to read reg before another instn

has written it.

– WAR (write after read): try to write before another has read it

– WAW (write after write): both write simult.

59

COSC 3P92

In-order exec, in-order completion– decode in cyc n,

exec n+1, writeback n+2 (except multiply in n+3)

– 2 instns decoded simult.

– uses scoreboard: 1 counter per reg keeping track of # instns using it as a source or destination

– keeps track of max # regs that can be processed concurrently

60

COSC 3P92

– idea: execute instns so long as resources are available, and no conflicts

– move order of instns to permit this

– registers are renamed automatically to reduce conflicts: “secret regs”

» eg. if a register is in conflict, rename it so conflict is removed.

» copy values to original named reg later if required.

– result: huge performance gain (we’re trying to make pipeline maximally useful!)

Out-of-order exec, reg renaming (cont)

61

COSC 3P92

Improving performance: speculative exec• block: a section of sequential code [4.45]

• Can increase throughput by moving instructions beyond their blocks

– hoisting: moving an instruction over a branch

• speculative execution: executing an instruction before it is known whether it will be needed

– OK to do it so long as there is no side effect (eg. write to memory, trap/interrupt)

– may sometimes cause slowdown if spec. exec fetches an instn from memory that isn’t needed

– otherwise, idea is to move slower instructions up the queue so that their processing can occur in the interim

• some solns:– speculative instns: only fetch/exec instructions that are in the cache

– poison bits: don’t set traps automatically; wait until that instn actually executed, and if a poison bit is set, then set the trap

62

COSC 3P92

Speculative exec

63

COSC 3P92

Example 1: Pentium II• 1. Fetch/decode [4.46]

– fetches instns and breaks them into m.i.’s

• 2.dispatch/exec– takes m.i.’s and execs them

• 3. retirement unit– completes exec, stores reg values (speculative exec)

• 1, 2, 3 above act as high-level pipeline• ROB (reorder buffer): table of m.i.’s to execute• Fetch/decode [4.47]

– 7-stage pipeline– multiple formats, sizes means instn decoding is involved– analyzes instns to determine: size, branch-prediction– usually between 1 and 4 m.i.’s per ISA instn.– uses reg renaming– both static, dynamic branch prediction used

• Dispatch/exec [4.48]

– 5 m.i.’s can be exec’d at once

64

COSC 3P92

P2-micro architecture

65

COSC 3P92

66

COSC 3P92

Example 2: UltraSPARC II• [4.49]

• RISC: all instns are 3-register microinstns already

• branch prediction: (i) cache flags; (ii) 2-bit prediction; (iii) compiler directions in instns

• tries to exec 4 instns in parallel all the time– instns may be executed out of order

• 9-stage pipeline [4.50]

– split integer, float pipelines

– int adds 2 stages (N1, N2) to keep it same as fp

67

COSC 3P92

UltraSPARC

68

COSC 3P92

UltraSPARC Pipeline

69

COSC 3P92

Example 3: picoJava II• [4.51]

• instn, data caches are optional

• register file (64 entries)– contains top 64 words of stack

– dribbling: reg file read/written to memory when it gets too empty/full

– “free” access, w/o accessing caches (which may not be used)

70

COSC 3P92

71

COSC 3P92

• 6-stage pipeline [4.52]

– CISC instns

– not superscalar: instns fetched, retired inorder (unlike Pentium II)

• no branch prediction alg (economy)

72

COSC 3P92

Folding• Folding [4.53, 4.54, 4.55]

– replace a set of m.i.’s with one m.i.

– looks up patterns in a table [4.55], and replaces with equivalent m.i.

– only possible if operands are high in stack, in register file

– huge gain in speed, like RISC performance

73

COSC 3P92

74

COSC 3P92

75

COSC 3P92

Comparing these examples

• common features– all m.i.’s contain opcode, 2 source regs, dest reg

– 1 m.i. per cycle

– deep pipelines

– split instn and data caches

• Pentium II: complexity is in deconstructing its CISC instns into micro-operations

• JVM: complexity is in folding sets of m.i.’s into single operations

• UltraSparc most straight-forward to implement, because instns require minimal decoding (all RISC instructions are micro-operations already!)

76

COSC 3P92

The end

Week 5 Lecture slides

Documents

Transcript of Week 5 Lecture slides