Exceptions and Interrupts

Computer Architecture - Superscalar Processors

Exceptions and InterruptsWhen an exception (overflow, page-fault) occurs

there are several instructions in the pipeline.The offending instruction must be captured (in the EPC and Cause registers), earlier instructions must be completed, later instructions must be flushed.

When there is an interrupt (context-switch, I/O signal) the processor has more freedom. It will usually “drain” the pipeline and then transfer control to the interrupt handler.

A new control line called EX.Flush zeros the control lines of the MEM and WB stages.

1/18


Exception Flushing

The exception and interrupt handling is called precise exceptions or precise interrupts, all instructions before the offending instruction are completed all instructions after the offending instruction are flushed.

PC Instructionmemory

4

Registers

Signextend

Mux

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Mux

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

=

ExceptPC

40000040

0

Mux

0

Mux

0

Mux

ID.Flush EX.Flush

Cause

Shiftleft 2

2/18


Goal: Reduce the Cycle TimeExecution Time (ET) =

(# instructions)*(average CPI)*(cycle time)Number of Instructions - Mainly dependent on

the program itself and compiler technology.Average CPI - Depends on the architecture of

the processor. In a pipelined datapath CPI -> 1.0 Cycle Time - Is a physical trait (תכונה) of the

chip. Is independent of the processors architecture.Right? Wrong!!!

3/18


Execution Time (ET) = (# instructions)*(average CPI)*(cycle time)The execution time of a program is a product of the above 3 factors.

Number of Instructions - Mainly dependent on the program itself and compiler technology. Depends also on the architecture. A RISC processor results in more instructions.

Average CPI - Depends on the architecture of the processor. A pipelined datapath with forwarding, hazard detection, branch prediction, and an optimizing compiler that can reorder code can reduce the CPI to almost 1.0 . A RISC processor executes more instructions but usually has a lower CPI.

Cycle Time - The smaller the microprocessor is the faster the clock can tick. This is a physical trait (תכונה) of the processor. The architecture of the chip is independent of the cycle time. Right? Wrong!!!


Shorter StagesIn our original multiple-cycle datapath we

assumed that the time to access memory or use the ALU is 2ns, accessing the RF takes 1ns.

Can this be shortened? Read 16 bits from memory, add only two 16-bit numbers, should take less.

Problem: Our processor uses 32-bit words.Solution: Read from memory or perform addition in

2 cycles.Problem: What’s the advantage?Solution: Pipeline the IF, EX, and MEM stages.

4/18


Super-Pipelining

Original 5

stage pipeline

New 8

stage pipeline

Time2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEM

Time2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEM

IF1 IF2 ID EX1 EX2 M1 M2 WB

Splitting pipeline stages is called super-pipelining each instruction is executed in 8 stages.

The CPI is the same: 1 instruction every cycle.However the cycle time is shorter.

5/18


Super-Pipelined Execution TimeA program executes 1,000,000 instructions.It is executed on a 5-stage pipeline with a cycle time

of 2ns, and on a 8-stage super-pipeline with a cycle time of 1.3ns. Where does it execute faster? (assume no hazards).

5-stage: 1,000,000*1*2ns = 2,000,000ns = 2ms8-stage: 1,000,000*1*1.3ns = 1,300,000ns = 1.3msSpeedup: 2/1.3 = 1.53Problems with super-pipelining:

Splitting stages isn’t that simple.More instructions flushed on branch mispredictions.

6/18


Goal: Lowering the CPIIs a CPI of 1.0 the lowest CPI achievable?addi $t2,$t0,4

sw $s0,0($t5)

subi $t3,$t1,-4

sw $s1,0($t7)

Every pair of instructions in the code above is independent.

What if we had 2 pipelines? One that performs R-types and branches and one that performs loads and stores?

A CPI of 0.5 is theoretically possible.

7/18


A Superscalar Pipeline

A processor that can fetch more than one instruction in a cycle is called a superscalar or multiple-issue processor.

add $s0, $t0, $t1

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBEX

IF ID MEMEX

Time2 4 6 8 10

MEM

WBMEM

add $s0, $t0, $t1

sub $t2, $s0, $t3

Programexecutionorder(in instructions)

IF ID WBEX

IF ID MEMEX

Time2 4 6 8 10

MEM

WBMEM

addi $t2,$t0,4

sw $s0,0($t5)

subi $t3,$t1,-4

sw $s1,0($t7)

8/18


A scalar is a single data value. A vector is an array of data values. Super computers (מחשבי על) of the 70s and 80s (like the

Cray X-MP) could perform operations on scalars or on vectors.

The code:for(i=0;i<256;i++) c[i]=a[i]*b[i];could be performed in a single cycle. This is called a vector operation. A “regular” operation (c=a*b;) is called a scalar operation. The early Crays could perform 1 scalar or 1 vector operation per cycle.

Processors the could perform more that 1 scalar operation per cycle were dubbed (כונו) super-scalar processors.

Multiple-issue means that multiple instructions are issued to the processor per cycle.


Superscalar Datapath

PCInstruction

memory

4

RegistersMux

Mux

ALU

Mux

Datamemory

Mux

40000040

Signextend Sign

extend

ALU Address

Writedata

9/18


In order to implement the superscalar datapath we have to add the following capabilities: Another read port for the instruction memory, two instructions are

fetched each cycle. Another 2 read ports and 1 write port for the register file. An additional ALU for effective address calculation. An additional sign-extender.

Of course the real world isn’t a perfect world. Not all instructions will arrive paired off as expected.

The IF stage must detect when two instructions can’t be issued each cycle and stall one of them. Or it can flip their order if the Load/Store instruction is the first of the pair.

In the drawing on the previous page there is a mistake. 8 must be added to the current PC.

We now have a new problem. Instructions using the value of a load must stall one cycle. Now the next two instructions can’t use the loaded value without stalling.


Superscalar Code SchedulingLoop: lw $t0,0($s1) #$t0=array element

addu $t0,$t0,$s2 #$t0=$t0+$s2

sw $t0,0($s1) #store result

addi $s1,$s1,-4 #decrement pointer

bne $s1,$zero,Loop #branch if $s1!=0 Reorder the code to avoid pipeline stalls in a superscalar

processor:

Loop: lw $t0,0($s1)

addi $s1,$s1,-4

addu $t0,$t0,$s2

bne $s1,$zero,Loop sw $t0,0($s1) 4 clock cycles to execute 5 instructions: CPI = 0.8, not 0.5

10/18


Loop UnrollingLoop unrolling, multiple copies of the

loop body are made:Loop: addi $s1,$s1,-16 lw $t0,0($s1)

lw $t1,12($s1) addu $t0,$t0,$s2 lw $t2,8($s1) addu $t1,$t1,$s2 lw $t3,4($s1) addu $t2,$t2,$s2 sw $t0,0($s1)

addu $t3,$t3,$s2 sw $t1,12($s1)

sw $t2,8($s1)

bne $s1,$zero,Loop sw $t3,4($s1) The overhead of the loop is reduced and more instructions

can be scheduled in parallel. CPI = 14 instructions in 8 cycles: 8/14 = 0.57

11/18


In Order Execution lw $t0,20($s2)

addu $t1,$t0,$t2

sub $s4,$s4,$t3

slti $t5,$s4,20

The above code is executed in program order, a cache-miss will stall the pipeline until the data is read from memory.

But the 3rd and 4th instructions are independent of the first two. Why wait? Lets execute them.

This is called Dynamic Pipelining or Out-Of-Order Execution (OOO Execution).

12/18


In order to implement Out-Of-Order (OOO) execution the instructions that are stalled must “wait” somewhere. If not they will be overwritten by the following instructions.

The IBM/Motorola Power PC family and the Intel Pentium Pro family of processors use Reservation Stations. Instructions are decoded and sent to the reservation station of the Functional Unit (FU) that will execute it: ALU - executes most integer operations. Integer multiplier/divider - some processors have separate units. Memory Access Unit (MAU) - loads/stores from memory, usually has

its own ALU for EA computation. Branch Unit (BU) - checks conditions and computes the target PC. FP add unit - performs FP additions, subtractions, negations … FP multiplier/divider/sqrt - some processors have separate units.

If the operands of the instruction are ready and the unit is free (no other instruction is using it) the instruction is executed.

If not it waits until data and structural hazards are resolved.


Reservation Stations

Several instructions (2-4) are fetched every cycle.The MEM stage is now bypassed by most

instructions.

Commitunit

Instruction fetchand decode unit

…

In-order issue

In-order commit

Load/Store

Floatingpoint

IntegerInteger …Functionalunits

Out-of-order execute

Reservationstation

Reservationstation

Reservationstation

Reservationstation

IFID

Issue

EX/MEM

WB (Commit)

13/18


OOO Execution, In-Order Commit Allowing instructions to commit (להתחייב) out-of-order results in imprecise (לא מדויק) interrupts.

An interrupted program resumes execution from the current PC when the interrupt occurred. Instructions that were in the pipeline are re-executed, so what?

It is possible that a later instruction has already executed and written a value to the RF or memory. The program is now re-executed with updated register values. This can cause the program to execute wrongly.


OOO Execution, In-Order Commitlw $t0,0($s0) # $s0 holds 100addi$t1,$t1,1 # $t0 holds 50add $t2,$t0,$t1 # $t2 = 151

addi executes before the load. The load causes a page-fault, an interrupt occurs. But addi has already committed (WB stage), $t0 contains 51.

When returning from the interrupt the code will be re-executed. $t2 will contain 152.

This is called an imprecise interrupt. To avoid this, commit is done in program order. Instructions wait in the commit buffer until their “turn” arrives.

14/18


Problems with In-Order CommitIn-order commit might be slower that OOO commit

but is safer and is used by all modern processors.div $1,$2,$3add $2,$3,$4sub $5,$3,$7addi$3,$2,100

The instructions following div don’t depend on $1.add can’t change $2 until div reads it. They can’t

be issued in the same cycle (same problem for $3). Where does add save the result in $2?Where are temporary results saved?

15/18


OOO Execution, In-Order Commit The 32 registers of the ISA (Instruction Set Architecture) are called the logical registers. They are updated only in-order at the commit stage. In the case of an exception or interrupt they hold the state of the program.

Instructions operate on physical registers. The MIPS R10000 has 64 physical registers. During decode each logical register is mapped to a physical one.

Thus several instructions can hold the same value of a logical register without worrying that it will be updated before it is used.

Instructions write to physical registers at the execute stage, thus the results are visible to instructions in the next cycle.

At the commit stage the physical register is written into the logical register.

Of course it is still impossible for instructions with dependencies to execute in the same cycle.


Logical and Physical RegistersThe 32 registers of the ISA are called logical

registers.They are mapped into physical registers.div $p1,$p20,$p30 # cycle 1add $p21,$p31,$p4 # cycle 1sub $p5,$p32,$p7 # cycle 1 or 2addi$p33,$p21,100 # cycle 2

16/18


PPC and Pentium III Diagram

Complexinteger

Store Load

Load/store

Floatingpoint

IntegerIntegerBranch

Decode/dispatch unit

Instruction queue

Register file

Instructioncache

DatacachePC

Branchprediction

Reorderbuffer

Commitunit

Reservationstation

Reservationstation

Reservationstation

Reservationstation

Reservationstation

Reservationstation

17/18


OOO Execution, In-Order Commit The MIPS architecture doesn’t use reservation stations. It uses structures called Instruction Queues.

An instruction waits in the Integer, Address, or Floating Point queue until its dependencies are satisfied (the operand values are obtained) and a FU is available.

The advantage is that there is more room in the queues for more instructions. The R10000 queues contain 16 instructions each. The PPC reservation stations contain 2 instructions each.

Another advantage is that an instruction can be executed on any FU that becomes free (in the case of several FUs of the same type). An instruction in a reservation station is “stuck” to that unit.

The disadvantage is that a queue may become a bottleneck and that the control is centralized. The control of reservation stations is distributed.


MIPS R10K

18/18

Exceptions and Interrupts

Documents

Transcript of Exceptions and Interrupts