2.Pipeline Processors

66
Chris Jesshope 1 2. Pipelined and superscalar processors Pipelined processors

description

Pipeling

Transcript of 2.Pipeline Processors

Page 1: 2.Pipeline Processors

Chris Jesshope!

1!

2. Pipelined and superscalar processors

Pipelined processors

Page 2: 2.Pipeline Processors

Chris Jesshope!

2!

(c) Chris Jesshope 3 November 16, 10

Two types of concurrency ♣  Pure parallel

  In a pure-parallel system multiple operations are performed simultaneously

♣  Pipelined parallel   In a pipelined-parallel system an operation is broken down into a

sequence of sub-tasks   The sub-tasks are interleaved in time, i.e. multiple sub-tasks from

different operations are performed simultaneously   This has already been illustrated in DRAM access

♣  Parallelism is the only weapon we have to increase throughput and in designing systems to overcome long latencies

(c) Chris Jesshope 4 November 16, 10

Pipelines

Operations decomposed to sequential tasks: Op: A ; B ; C ; D

A B C D

Pipelined concurrency requires sequences of operations. Given Opi, 0≤i≤n, we execute sub-tasks concurrently as follows:

Opi: Ai+3||Bi+2||Ci+1||Di

Ai+3 Bi+2 Ci+1 Di

; denotes sequential composition || denotes parallel composition

Need to capture intermediate results and we have a start-up phase and flush periods where we do not have full concurrency.!

Page 3: 2.Pipeline Processors

Chris Jesshope!

3!

(c) Chris Jesshope 5 November 16, 10

Pipeline timing diagram

1!1!

1!1!

time!

concurrency!

2!2!

2!

A! B! C! D!

2!

3!3!

3!3!

4!4!

4!4!

5!5!

5!5!

Only when pipeline is full is there maximal concurrency!

(c) Chris Jesshope 6 November 16, 10

Pipeline performance ♣  This pipelines has a length of 4 subtasks, assume each

sub-task takes t seconds   for a single operation we get no speedup; it takes 4t

seconds to complete each of the subtasks   this is the same as performing each sub task in sequence

on the same hardware

♣  In the general case - for n operations - it takes 4t seconds to produce the first result and t seconds for each subsequent result

Page 4: 2.Pipeline Processors

Chris Jesshope!

4!

(c) Chris Jesshope 7 November 16, 10

Generic Pipeline Performance

♣  For a pipeline of length L then the time it takes to process n operations is as follows:

T(n) = Lt + (n-1)*t = (L-1)t + n*t ♣  We can characterise all pipelines by the two parameters, the startup

time and the maximum rate i.e. the rate at which operations are completed once the pipeline is full.

♣  Another useful parameter measures the number of operations required to produce half the maximum rate - the half-performance vector length

L

(c) Chris Jesshope 8 November 16, 10

Pipeline Parameters ♣  maximum rate, = t-1 (called r infinity)

♣  Pipeline start-up time, S = (L-1)t

♣  Half performance vector length, n1/2 :

(L-1)t + n1/2 t = 2 n1/2* -1 = 2 n1/2 * t

n1/2 = L-1 = S/t ♣  It can be seen that n1/2 is equal to the start up time in units

of the clock period

r∞

r∞

r∞

Page 5: 2.Pipeline Processors

Chris Jesshope!

5!

(c) Chris Jesshope 9 November 16, 10

Asymptotic performance

Perfo

rman

ce!

Number of operations!

=1/t!

/2!

n1/2!

r∞

r∞

(c) Chris Jesshope 10 November 16, 10

Pipelined computers

♣  Pipelining can be applied to processing sequences of instructions

♣  Or in processing sequences of complex operations   The former is used in scaler and super-scaler

computers   Vector computers execute multiple operations

using one instruction

Page 6: 2.Pipeline Processors

Chris Jesshope!

6!

(c) Chris Jesshope 11 November 16, 10

Instruction execution

♣  To exploit pipelining in instruction processing the instructions set must be regular and simple

♣  Each operation must be similar   E.g. RISC (Reduced Instruction set computer) see:

http://www.science.uva.nl/~jesshope/web-site-08236/home.html   Complex instruction set computers can exploit

pipelining but the complex instructions must be broken down into RISC-like micro-instructions first ♥  E.g. Intel X86

(c) Chris Jesshope 12 November 16, 10

A simple pipeline

♣  We can identify a number of sequential tasks (stages) in the execution of an instruction leading to pipelined implementations - here is one example:   Instruction fetch (IF)   Instruction decode & register read (RR)   Calculate: result or memory address or branch taken (EX)   Access memory (MEM)   Write back result (WB)

♣  These becomes the concurrent stages in the pipeline ♣  Note that it requires two memories to read instruction and

read or write data concurrently (IF & MEM stages)   Hence the requirement for separate caches at level 1

Page 7: 2.Pipeline Processors

Chris Jesshope!

7!

shift left 2

Registers

read register 1

write data

write register

read data 1

read data 2

zero

ALU result

Sign extend

16 32

write data

read address

write address

Read data Instruction memory

instruction

4

Add

Data memory

sum PC

read address

Add

IF/ID ID/EX EX/MEM EX/WB

Simplified Pipelined Datapath - MIPS ISA

read register 2

Opcode etc. to control!

(c) Chris Jesshope 14 November 16, 10

Control signals also pipelined

IF/ID ID/EX EX/MEM MEM/WB

Control

EX

M

WB

M

WB

WB

Page 8: 2.Pipeline Processors

Chris Jesshope!

8!

(c) Chris Jesshope 15 November 16, 10

Pipeline performance

♣  Long instruction sequences mean it should be possible to execute instructions at approaching one instruction per cycle in a pipelined datapath

♣  There are problems in achieving this - they are:   Hazards e.g. data dependencies between instructions or

branches in the instruction stream   Long-latency operations e.g. memory references or

floating point operations

♣  Approaches used to improve instruction execution rate   using multiple functional units   Issuing multiple instructions concurrently   executing instructions out of programmed order

(c) Chris Jesshope 16 November 16, 10

Measuring performance

♣  Performance depends on the pipeline’s clock rate ♣  The number cycles required per instructions CPI

  or its inverse the number of instructions executed per cycle IPC

♣  The product of IPC and clock rate gives a measure of a processor’s performance

Page 9: 2.Pipeline Processors

Chris Jesshope!

9!

Pipeline performance

♣  Here we have some interesting observations   the processors clock is slower the more complex

the task that must be completed in a single period   …but operations have a fixed complexity   so to increase the clock rate instructions are

broken down into smaller sub-tasks   smaller tasks mean deeper pipelines so more

instructions need to be executed to fill the pipeline

(c) Chris Jesshope 17 11/16/10

Multiple instruction issue

Page 10: 2.Pipeline Processors

Chris Jesshope!

10!

(c) Chris Jesshope 19 November 16, 10

Multiple execution units

♣  In the simple pipeline register-to-register operations have a wasted cycle   a memory access is not required but this stage still reqires a

cycle to complete the operations ♣  Decoupling memory access and operation execution avoids

this e.g. use an ALU plus a memory unit   … but either we need two write ports to the register file or

arbitration on a single port

IF! ID!Ex! WB!

Ad! M! WB!

Performs R-to-R ops!

Performs loads and stores!

(c) Chris Jesshope 20 November 16, 10

Pipeline hazards

♣  There are three types of pipeline hazard that must be identified and resolved   structural hazard - two instructions require the

same resource (e.g. register file port)   control hazard - e.g. branches of control   data hazard - read after write dependency

♣  Each hazard, when encountered will cause the pipeline to stall until the hazard is resolved   stall means one or more stages are blocked and can

not accept more instructions

Page 11: 2.Pipeline Processors

Chris Jesshope!

11!

(c) Chris Jesshope 21 November 16, 10

Structural hazard - register

♣  A structural hazard occurs when a resource in the pipeline is required by more than one instruction e.g. one write port   a resources may be an execution unit or a register port   a multi-function pipeline has a hazard writing registers

lw a addr add b c d

IF!lw! ID!

Ex! WB!

Ad! M! WB!

IF!add! ID!

Ex! WB!

Ad! M! WB!

IF!lw! ID!

Ex! WB!

Ad! M! WB!

IF!add! ID!

Ex! WB!

Ad! M! WB!

Resolved by stalling the pipeline!

stall or bubble!

hazard!

IF!add! ID!

Ex! WB!

Ad! M! WB!

(c) Chris Jesshope 22 November 16, 10

Structural hazard - execution unit

♣  Some operations require more than one pipeline cycle   mult is more complex than add (often requires 2 cycles)   floating point still more complex still (~ 5 cycles)

♥  mult c d e ♥  add f g h

IF!mult! ID! Ex! WB!Ex!

IF!add! ID! Ex!

hazard!

IF!mult! ID! Ex! WB!Ex!

IF!add! ID! WB!Ex!

bubble!

Resolved again by stalling the pipeline !

Page 12: 2.Pipeline Processors

Chris Jesshope!

12!

(c) Chris Jesshope 23 November 16, 10

Eliminating structural hazards

♣  Because structural hazards result from contention for hardware resources they can only be removed by duplicating resources   the register file hazard can be resolved by adding an

additional write port to the register file so that two writes can occur simultaneously

  in the case of the execution unit this hazard can be solved by using dedicated functional units or by pipelining the functional unit

  Can have: memory address, integer logical or add, integer multiplication, floating point as separate funtion units

(c) Chris Jesshope 24 November 16, 10

Multiple functional units ♣  This is not a new principle

  the CDC 6600 (1963) & CDC 7600 (1970) had 10 & 9 functional units respectively

  the 7600’s operations were pipelined (except division) i.e. they could accept new operands every cycle

♣  These computers were scalar RISC supercomputers designed before the term RISC had been invented

IF! ID!

Add! WB!

Mem! WB!Adr!

Mul! WB!Mul! Mul!

Float! WB!Float!Float!Float!Float!

The goal was to achieve as close as possible to one instruction executed per cycle!IPC = 1!

Page 13: 2.Pipeline Processors

Chris Jesshope!

13!

(c) Chris Jesshope 25 November 16, 10

Control hazard ♣  Branches - in particular conditional branches - can also

cause pipeline hazards   the outcome of a conditional branch is not known until the Ex

stage but is required at IF to load another instruction and keep the pipeline full

  A simple solution is to assume by default that the branch falls through - i.e. is not taken

IF!beq! ID! Ex! WB!Branch not taken!

IF! ID! Ex! WB!

Continue to fetch but stall ID until branch target is known – one cycle lost!

IF!beq! ID! Ex! WB!Branch is taken!

IF! ID! Ex! WB!

Need to refetch at new target – two cycles lost!IF!

Wrong target!

(c) Chris Jesshope 26 November 16, 10

Eliminating control hazards ♣  Control hazards are more difficult to eliminate as they are

procedural rather than structural ♣  Options include…

  Embedding a delay in the action of a branch - branch delay slot   Predicting the branch target i.e. whether the branch is taken

♥  Statically - compiler decides ♥  Dynamically - a huge research topic discussed later

  Fetching from both targets - requires a branch target address prediction

  Eliminating branches through predication

Page 14: 2.Pipeline Processors

Chris Jesshope!

14!

(c) Chris Jesshope 27 November 16, 10

Branch delay slot

♣  Assume the branch takes effect on the second cycle after executing the branch rather than the next   This eliminates the stall if the branch is taken by filling

the stalled cycle with an independent instruction

L1: lw a x[i]! add a a a! sw a x[i]! sub i i 4! beq i 0 L2! j L1!L2: ---!1 cycle stall on all but last instruction!

L1: lw a x[i]! add a a a! sw a x[i]! beq i 1 L2! sub i i 4! j L1!L2: ---!

no stalls but last iteration executes an unecessary sub!

Branch delay slot always executed!

n.b. add b c d ==> (b=c+d)!

(c) Chris Jesshope 28 November 16, 10

Fetching both targets

♣  Using an additional I-cache port both fall through and branch taken addresses can be fetched   Then at EX a choice is made as to which is decoded   When coupled with a branch delay slot this eliminates all

wasted cycles but…

♣  multiple conditional branches in the pipeline will break this solution   longer pipelines - say 20 stages - might contain several

branches in the pipe prior to EX   every new branch doubles the number of paths fetched

Page 15: 2.Pipeline Processors

Chris Jesshope!

15!

(c) Chris Jesshope 29 November 16, 10

Predicated instruction execution

♣  Control flow can (in some cases) be replaced by guarded or predicated instruction execution…   a condition sets a predicate register (boolean)   instructions are predicated on that register   any state change (WB or Mem write) only occurs if the

predicate is true   this moves the conditional operation from the IF stage to

the WB stage   useful in long pipelines where branch hazards can be costly   it removes a control hazard at the expense of redundant

instruction execution

(c) Chris Jesshope 30 November 16, 10

Predication - an example

Instr: i!

i+1 cond branch!

i+2! I+5!

i+3! i+6!

branch!

i+7!

false! true!

Branching code! Predicated code!Instr: i!

Set C!

i+2!

i+5!

i+3!

i+6!

i+7!

!C!

!C!

C!

C!

No hazards but redundant execution!

predicates!

Page 16: 2.Pipeline Processors

Chris Jesshope!

16!

(c) Chris Jesshope 31 November 16, 10

Data hazard

♣  Data hazards are caused when the output from one instruction is required as an operand by a subsequent one   this is called a data dependency

♣  The hazard occurs because of the latency in the pipeline   the result (output from one instruction) is not written

back to the register file until the last stage of the pipe   the operand (input of a subsequent instruction) is required

at register read - some cycles prior to writeback   the longer the RR to WB delay the more cycles the pipe

must be stalled for to execute the subsequent instruction

(c) Chris Jesshope 32 November 16, 10

Eliminating data hazards

IF!mult! ID! Ex! WB!

IF!add! ID! Ex! WB! n.b. register read

is in the ID stage!

dependency!mult a b c!add d a f!

Two bubbles!

In a data hazard (RAW) the data from EX is required immediately but it takes two cycles to write it to the register and read it again from the same register!The solution is to take the data directly from the pipeline register using a so-called bypass bus!

IF!mult! ID! Ex! WB!

IF!add! ID! Ex! WB!

dependency!Data bypassed from pipeline register to execution unit!

Page 17: 2.Pipeline Processors

Chris Jesshope!

17!

(c) Chris Jesshope 33 November 16, 10

Bypass busses

registers!

Bypass buses!

The operand is taken from the pipeline register and input directly to the ALU on the subsequent cycle!

Because there are two stall cycles, a dependency to the next but one instruction also requires bypassing!

ALU!

one cycle delay!two cycles delay!three + cycles delay!

(c) Chris Jesshope 34 November 16, 10

Data dependencies

♣  Data dependencies can also be eliminated by appropriate scheduling if independent operations can be rearranged   scheduling in this context is the process of sequencing

the operations according to the data-dependency graph   instructions are free to be scheduled in any order

provided that all dependencies are met – concurrency of operation

♣  Scheduling can be dynamic or static   static scheduling is performed by the compiler   dynamic schedling is performed by hardware

♣  Scheduling is generaly an NP-hard problem

Page 18: 2.Pipeline Processors

Chris Jesshope!

18!

Example of a dependency graph

a := (sqrt(b*b-4ac))/2a

(c) Chris Jesshope 35 11/16/10

*2!

*4!

a!

b! *!

c!

*! -! sqrt! /!

These can execute in any order!

These can not!

Superscalar processors

Dynamic dependency analysis

Page 19: 2.Pipeline Processors

Chris Jesshope!

19!

(c) Chris Jesshope 37 November 16, 10

Pipeline design space issues

♣  Depth of pipeline - Superpipelining   further dividing pipeline stages increases frequency   but introduces more scope for hazards   and higher frequency means more power dissipated

♣  Number of functional units - Scalar ILP   instructions fetched and decoded in sequence   multiple operations executed in parallel   avoids waiting for long operations to complete

♣  Concurrent issue of instructions - Superscalar ILP   multiple instructions fetched and decoded concurrently   issued to multiple functional units

(c) Chris Jesshope 38 November 16, 10

Scalar vs Superscalar ♣  Scalar processors issue instructions in order but instructions may complete out of

order ♣  Superscalar processors issue instructions concurrently and possibly out of

programmed order

Page 20: 2.Pipeline Processors

Chris Jesshope!

20!

(c) Chris Jesshope 39 November 16, 10

Superscalar processors

♣  Most modern microprocessors are superscalar ♣  Issues covered in this section are:

  parallel fetch decoding and issue ♥  100s of instructions in-flight simultaneously

  out-of-order execution and sequential consistency ♥  Exceptions and false dependencies

  finding parallelism and scheduling its execution   application specific engines

♥  e.g. SIMD & prefetching

(c) Chris Jesshope 40 November 16, 10

Principles ♣  Superscalar example based on a 3-stage pipeline: i.e. fetch (f) decode (d)

and execute (e) multiple instructions simultaneously   Here we see a 3 stage pipeline with an issue width of 1 (scalar) and 2

(superscalar)

f! d! e!

f! d! e!

f! d! e!

f! d! e!

f! d! e!

f! d! e!

Scalar processor max IPC = 1!

f! d! e!

f! d! e!

f! d! e!

f! d! e!

f! d! e!

f! d! e!

Superscalar processor max IPC = 2!

Time!

Page 21: 2.Pipeline Processors

Chris Jesshope!

21!

(c) Chris Jesshope 41 November 16, 10

Instruction-level parallelism - ILP ♣  ILP is the number of instructions issued per cycle (issue concurrency) ♣  IPC the number of instructions executed per cycle is limited by:

  the ILP   the number of true dependencies   the number of branches in relation to other instructions   the latency of operations in conjunction with dependencies

♣  Current microprocessors issue from 4-8 instructions to up to 12 functional units concurrently but achieve an IPC of typically 2-3

Long execution time with resource dependency (pipelined fn. unit)!

Long execution time with a!true data dependency!

f! d! e1! e2! e3! e4!

f! d! e1! e2! e3! e4!

f! d! e1! e2! e3! e4!

f! d! e1! e2! e3! e4!

(c) Chris Jesshope 42 November 16, 10

Instruction issue policy ♣  Just widening of the processor’s pipeline does not necessarily improve

its performance ♣  The processor’s policy in fetching, decoding and executing instructions

also has a significant effect on its performance ♣  The instruction issue policy is determined by its look-ahead capability

in the instruction stream   For example with no look-ahead, if a resource conflict halts

instruction fetching the processor is not able to find any further instructions until the conflict is resolved

  If the processor is able to continue fetching instructions it may find an independent instruction that can be executed on a free resource out of programmed order

♣  This is a decoupling of instruction fetching from instruction execution and this is the basis for most modern microprocessors

Page 22: 2.Pipeline Processors

Chris Jesshope!

22!

(c) Chris Jesshope 43 November 16, 10

1. In-order issue with in-order completion

♣  Simplest instruction issue policy   Instructions issued in exact program order with

results written in the same order   This is shown for comparison purposes only as very

few pipelines use in-order completion

(c) Chris Jesshope 44 November 16, 10

1. In-order issue with in-order completion

6 instructions!require 8 cycles!

IPC = 6/8 = 0.75!

♣  Assume a 3 stage execution in a pipeline that can issue two instructions, execute three instructions and write back two results every cycle… assume:   I1 requires 2 cycles to execute   I3 and I4 are in conflict for a functional unit   I5 depends on the value produced by I4   I5 and I6 are in conflict for a functional unit

Decode! Execute! Writeback!Time!I1! I2!I3! I4! I1! I2!I3! I4! I1!

I4! I1!I3! I2!I5! I6! I4!

I6! I5! I3! I4!I6!

I5! I6!

Page 23: 2.Pipeline Processors

Chris Jesshope!

23!

(c) Chris Jesshope 45 November 16, 10

2. In-order issue with out-of-order completion ♣  Out-of-order completion, improves performance of instructions with long

latency operations such as loads and floating point ops ♣  The modifications made to execution are:

  any number of instructions allowed in the execution stage up to the total number of pipeline stages for all functional units

  instruction issue is not stalled when an instructions takes more than one cycle to complete

(c) Chris Jesshope 46 November 16, 10

2. In-order issue with out-of-order completion

6 instructions!require 7 cycles!

IPC = 6/7 = 0.86!

♣  Again assume a processor issues two instructions, executes three instructions and writes back two results every cycle   I1 requires 2 cycles to execute   I3 and I4 are in conflict for a functional unit   I5 depends on the value produced by I4   I5 and I6 are in conflict for a functional unit

Decode! Execute! Writeback!Time!I1! I2!I3! I4! I1! I2!I3! I4! I1! I2!I3!

I1!I4! I3!I5! I6!I5! I4!I6!I6! I5!

I6!

Page 24: 2.Pipeline Processors

Chris Jesshope!

24!

(c) Chris Jesshope 47 November 16, 10

2. In-order issue in out-of-order completion ♣  In a processor with out-of-order completion instruction issue is

stalled when:   There is a conflict for a functional unit   An instruction depends on a result that is not yet

computed - a dependency

♥  can use register specifiers to detect dependencies between instructions and logic to ensure synchronisation between producer and consumer instructions – e.g. scoreboard logic

♣  There is one other situation when issue must be stalled… it is a new type of dependency caused by out-of-order completion   It is called an output dependency

(c) Chris Jesshope 48 November 16, 10

Output Dependency ♣ Consider the code to the right:

  the 1st instruction must be completed before the 3rd otherwise the 4th instruction may receive the wrong result!

  this is a new type of dependency caused by allowing out-of-order completion

  the result of the 3rd instruction has an output dependency on the 1st instruction

  the 3rd instruction must be stalled if its result may be overwritten by a previous instruction which takes longer to complete

R3 := R3 op R5!

R4 := R3 + 1!R3 := R5 + 1!R7 := R3 op R4!

Page 25: 2.Pipeline Processors

Chris Jesshope!

25!

(c) Chris Jesshope 49 November 16, 10

3. Out-of-order issue with out-of-order completion ♣  In-order issue stalls when the decoded instruction has

  a resource conflict, a true data dependency or an output dependency on an uncompleted instruction

  this is true even if instructions after the stalled one can execute   to avoid stalling, decode must be decoupled from execution

♣  Conceptually out-of-order issue decouples the decode/issue stage from instruction execution   it requires an instruction window between the decode and

execute stages to buffer decoded or part pre-decoded instructions   this buffer serves as a pool of instructions giving the processor a

look-ahead facility the size of the buffer (n.b. in this simple pipeline it is not a pipeline stage)

  instructions are issued from the buffer in any order provided there are no resource conflicts or dependencies with executing instructions

(c) Chris Jesshope 50 November 16, 10

3. Out-of-order issue with out-of-order completion

6 instructions!require 6 cycles!

IPC = 1!

♣  Again assume a processor issues two instructions, executes three instructions and writes back two results every cycle but now has a issue window of at least three instructions   I1 requires 2 cycles to execute   I3 and I4 are in conflict for a functional unit   I5 depends on the value produced by I4   I5 and I6 are in conflict for a functional unit

I1! I2!Decode! Execute! Writeback!Time! I. Window!

I3! I4! I1! I2!I1, I2!I5! I6! I1! I2!I3!I3, I4!

I1!I4! I3!I6!I4, I5, I6!

I5!I5! I4!I5! I6!

Page 26: 2.Pipeline Processors

Chris Jesshope!

26!

(c) Chris Jesshope 51 November 16, 10

Anti-dependency ♣  Out-of-order issue introduces yet another

dependency - called an anti-dependency   the 3rd instruction can not be completed

until the second instruction has read its operands

  otherwise the 3rd instruction may overwrite the operand of the 2nd instruction

  we say that the result of the 3rd instruction has an anti-dependency on the 1st operand of the 2nd instruction

  this is like a true dependency but reversed

R3 := R3 op R5!

R4 := R3 + 1!R3 := R5 + 1!R7 := R3 op R4!

(c) Chris Jesshope 52 November 16, 10

Dependencies or storage conflicts ♣  We have now have seen three kinds of dependencies

  True (data) dependencies … read after write (RAW)   Output dependencies … write after write (WAW) - out of order completion   Anti dependencies … write after read (WAR) - out of order issue

♣  Only true dependencies reflect the flow of data in a program and only true dependencies should require require the pipeline to stall   when instructions are issued and completed out of order the one-to-one

relationship between registers and values at any given time is lost   new dependencies arise because registers hold different values from

independent computations at different times - a resource dependency

♣  Anti dependencies and output dependencies then are really just storage conflicts and can be eliminated by introducing new registers to re-establish the one-to-one relationship between registers and values at a given time

Page 27: 2.Pipeline Processors

Chris Jesshope!

27!

(c) Chris Jesshope 53 November 16, 10

Renaming - example code

♣  Renaming dynamically rewrites the assembly code using a larger register set   A renamed register is allocated

somehow and remains in force until commit

  Subsequent use of a register name as an operand uses the latest rename of it

  may need to keep several renamed registers if using branch prediction

R3b := R3 op R5!

R4 := R3b + 1!R3c := R5 + 1!R7 := R3c op R4!

R3 -> R3b -> R3c!scope R3b!

(c) Chris Jesshope 54 November 16, 10

Register renaming ♣  Storage conflicts can be removed in out-of-order issue

microprocessors by renaming registers   this requires additional registers e.g. a rename buffer or an

extended register file not identified by the program   the relationship between program name and actual location has

to be maintained while the instructions are executing ♣  Instructions are executed out of sequence from the instruction

window using the renamed registers   allocate new register on multiple use of same target register   store mapping from instruction register to architectural register   instructions executing after a rename use the renamed register

rather than instruction-specified register as an operand ♣  A commit stage is used to preserve sequential machine state

by storing values to architectural registers in program order

Page 28: 2.Pipeline Processors

Chris Jesshope!

28!

Register renaming

♣  Can either rename at instruction issue   explicit renaming maps architectural register to physical

register used in conjunction with a scoreboard to track dependencies

♣  Can remap implicitly using reservation stations at the execute stage and a reorder buffer on instruction completion   reservation stations use dataflow to manage true

dependencies and implicitly rename registers   the reorder buffer holds the data until all previous

instructions have completed then write it to the architectural register specified

(c) Chris Jesshope 55 11/16/10

(c) Chris Jesshope 56 November 16, 10

Explicit renaming ♣  Register renaming logic maps an ISA register address to a physical

register address ♣  At any point in the fetched code, there is just one valid mapping

for a given logical register specifier   subsequent instructions will use that mapping to read their

operands until a further remapping is made

♣  Two implementations of the mapping function are possible 1.  RAM, where the physical address is stored in a table the size of

the ISA register file and read using the ISA register address 2.  CAM, where the ISA address is stored in a content addressable

table of the size of the physical register file - in this case the ISA address is used as a tag and multiple matches to this tag are resolved using a currently-valid bit.

CAM = content addressable memory!

Page 29: 2.Pipeline Processors

Chris Jesshope!

29!

(c) Chris Jesshope 57 November 16, 10

Physical id!Physical id!

Implementation of renaming

ISA register file size!

Physical id!

ISA target id!

RAM!

Physical register file

size!

Physical id!ISA id!1!

Physical id!ISA id!0!

Valid bit!

Previous mappings held from branch prediction to confirmation!

Freelist of physical register ids!

CAM!

at decode!

Physical id!

at commit!

Dynamic scheduling

Out-of-order issue OoO

Page 30: 2.Pipeline Processors

Chris Jesshope!

30!

(c) Chris Jesshope 59 November 16, 10

Abstract OoO instruction issue

I-cache

Lw/Sw Integer Branch Float

Instruction window synchronisation and scheduling

Execution units

Instruction fetch & decode

Waiting for data

Ready to execute

Writes resolve dependencies and allow new instructions to be scheduled

Parallel Instruction issue

Empty slot

This is a form of dataflow instruction execution… maintain state of register synchronisation and schedule instructions (issue to execution units)

(c) Chris Jesshope 60 November 16, 10

Dynamic techniques for OoO

♣  Out-of-order issue provides exploitation of difficult ILP  works when dependencies are not known at compile time  gives binary compatibility code across generations of an ISA

♣  Scoreboard originally implemented in CDC 6600 in 1963  centralized control structure  CDC 6600 had no register renaming and no forwarding  pipeline stalls for WAR and WAW hazards

♣  Reservation stations first used in IBM 360/91 in 1966  distributed control structures   implicit renaming of registers i.e. dispatch pointers into RF

that are held in reservation stations associated with each of the functional units ♥  these tags are matched with result tags using a common data bus ♥  results broadcast to all reservation stations for RAW

 WAR and WAW hazards eliminated by register renaming

Page 31: 2.Pipeline Processors

Chris Jesshope!

31!

Reservation stations

(c) Chris Jesshope 61 November 16, 10

Fn unit!

data/tag! data/tag!v vdata/tag! data/tag!v v

common data bus!tag identifies the source of the data – i.e. which fn unit!

v is a valid bit!

Instructions can be issued before data is available but valid bit is set to 0!

Only when an instruction has two valid operands can it be sent to the functional unit!

Every result matches all l-r tags in every functional unit and data is grabbed if a match occurs!

(c) Chris Jesshope 62 November 16, 10

Scoreboard approach (CDC 6600)

Func

tiona

l Uni

ts!

FP Mult!FP Mult!

FP Divide!

FP Add!

Integer!

Scoreboard logic!tracks fn unit use & dependencies!

Register!file!

Page 32: 2.Pipeline Processors

Chris Jesshope!

32!

(c) Chris Jesshope 63 November 16, 10

Four-stage Scoreboard Control ♣  Issue - decode instructions & check for structural hazards (ID1)

  instructions issued in program order (for hazard checking)   don’t issue if structural hazard   don’t issue if instruction is output dependent on any previously

issued but uncompleted instruction (i.e. WAW hazards) ♣  Read operands - wait until no data hazards, then read ops (ID2)

  all dependencies (RAW) resolved in this stage - the scoreboard identifies when a register is valid from prior instructions’ WB

  no forwarding of data (in the CDC 6600) ♣  Execution - operate on operands (EX)

  the functional unit begins execution upon receiving operands and signals the scoreboard on completion

♣  Write result - finish execution (WB)   stall until no WAR hazards on previous instructions

Example: DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14

CDC 6600 scoreboard would stall SUBD until ADDD reads operands

More information

♣  http://www.cs.umd.edu/class/fall2001/cmsc411/projects/dynamic/example1.html

(c) Chris Jesshope 64 11/16/10

Page 33: 2.Pipeline Processors

Chris Jesshope!

33!

(c) Chris Jesshope 65 November 16, 10

Tomasuloʼs algorithm organization!

FP adders!

Add1!Add2!Add3!

FP multipliers!

Mult1!Mult2!

From Mem! FP Registers!

Reservation !Stations!

Common Data Bus (CDB)!

To Mem!

FP Op!Queue!

Load Buffers!

Store !Buffers!

(c) Chris Jesshope 66 November 16, 10

Three-stage Tomasulo Algorithm

1.  Issue - get instruction from FP Op Queue   if reservation station free (no structural hazard)

control issues instr and operands (renames registers)

2.  Execution - operate on operands (EX)   when both operands ready then execute;

if not ready, watch Common Data Bus for result

3.  Write result - finish execution (WB) write on Common Data Bus to all awaiting units

mark reservation station available

♣  Normal data bus is data + destination ♣  Common data bus is data + source

  64 bits of data + 4 bits of functional unit address   write if reservation buffer expects result from that

functional unit

Page 34: 2.Pipeline Processors

Chris Jesshope!

34!

More information

♣  Demo using web applet University of Edinburgh   http://www.dcs.ed.ac.uk/home/hase/webhase/demo/

tomasulo.html

♣  Another demo from University of Massachusetts   http://www.ecs.umass.edu/ece/koren/architecture/

Tomasulo1/tomasulo.htm

(c) Chris Jesshope 67 11/16/10

(c) Chris Jesshope 68 November 16, 10

Recent example of Tomasulo ♣  The PPC 620 is an implementation of the 64-bit Power PC architecture

  It can fetch, issue and retire up to 4 instructions per cycle

♣  It has 6 functional units which can issue instructions independently from reservation stations   2 simple integer units (single cycle)   1 integer unit for multiply and divide (3 - 20 cycles latency)   1 load/store unit (1 cycle integer 2 cycle floating point)   1 floating point unit (2 - 31 cycles for divide)   1 branch unit

♣  N.b. The G5 processor is a successor to the PPC 620, see blackboard for more details

Page 35: 2.Pipeline Processors

Chris Jesshope!

35!

Fetch Unit!

Instruction!Cache!

Dispatch unit!With 8-entry!

Instruction queue! Completion unit!with reorder buffer!

Integer registers! Fl point registers!

Int 0! Int 1! Mpy! LSU! FPU! BPU!

Data!Cache!

Result status bus!

Reorder buffer information!

Instruction!Dispatch !Buses!

Reservation stations!

Completion unit!with reorder buffer!

(c) Chris Jesshope 70 November 16, 10

Machine state ♣  Issue instructions out of order and sequential-order machine

state is lost   this can cause problems when exceptions occur   how can we define the machine state if many instructions

are in flight - not necessarily in program order? ♣  To checkpoint the sequential state we must retire or commit

instructions only when all previous instructions have finished   only at this stage can the result of that instruction be

written into the architectural register ♣  This is managed in a reorder buffer that provides sequential

state out of the end of the pipeline

Page 36: 2.Pipeline Processors

Chris Jesshope!

36!

(c) Chris Jesshope 71 November 16, 10

Reorder buffer

♣  The reorder buffer stores information about all instructions executing between issue and retire stages   it can also store the results of those instructions pending

a write to the architectural register file, which is another implicit form of register renaming

♣  The reorder buffer is a queue or FIFO   instructions are written to it at the tail in program order

at the issue stage   instructions removed from it at the head but only when

they have completed execution   at this stage the results are written into the user-

specified register it is then said to be retired or committed

(c) Chris Jesshope 72 November 16, 10

Reorder buffer

Ins. i!v = 1!

Ins. i+1!v = 1!

Ins. i+2!v = 0!

Ins. i+1!v = 1!

Ins. i+2!v = 1!

Ins. i+1!v = 1!

Ins. i+2!v = 0!

head!tail!

Flag v determines whether result has been written or not!

Results available for bypassing or forwarding!

Unable to commit these! Can commit these!

…! …!

Page 37: 2.Pipeline Processors

Chris Jesshope!

37!

(c) Chris Jesshope 73 November 16, 10

Variations

♣  There are many variations on the implementation of dynamic OoO scheduling algorithm   can combine explicit register

renaming with a scoreboard and a standard pipeline (e.g. Alpha 21264)

Fetch! Decode/!Rename! Execute!

scoreboard!+!

rename!table!

♣  Issue/completion state can occur in many different places   one combined (architecture + rename) register file   architectural register file + reservation stations + reorder

buffer   etc.

(c) Chris Jesshope 74 November 16, 10

Explicit renaming

♣  Requires rapid access to a table of translations ♣  And a physical register file that has more registers than

specified by the ISA ♣  Ability to figure out which physical registers are free

  if there are no free registers then stall on issue

♣  Register renaming does not require reservation stations but…   many modern architectures use explicit register

renaming + Tomasulo-like reservation stations to control execution

Page 38: 2.Pipeline Processors

Chris Jesshope!

38!

(c) Chris Jesshope 75 November 16, 10

Memory state

♣  Note that although a consistent register state may be identified using a reorder buffer the memory state is a different matter   this is because of memory delays, cache writeback

strategies etc

♣  Modern microprocessors hold reads and writes in buffers similar to reservation stations   these will match reads with writes and also bypass data

so that a read to a location in the buffer that has not yet been written can provide its value to the read

♣  This allows load/store re-ordering and can improve locality of memory accesses

(c) Chris Jesshope 76 November 16, 10

Load store reordering

♣  Can safely allow a load to bypass a store as long as the addresses of load and store are different

♣  If addresses are not know then either   do not allow bypassing   or speculatively bypass the store but squash the

load if the address turns out to be the same

♣  Can also allow loads to bypass loads on cache misses   This is called a lock-up free cache but it can

complicate the cache coherence protocol

Page 39: 2.Pipeline Processors

Chris Jesshope!

39!

(c) Chris Jesshope 77 November 16, 10

Control hazards revisited

♣  Superscalar pipelines are typically super-pipelined and have many stages ~ 10-20

♣  They also have wide issue widths ~ 4-8   we may therefore have 40-160 instructions in flight

♣  The latency to resolve a branch condition is large   17 cycles for a conditional branch in Pentium 4!   hence many pipeline slots will be filled with

instructions from the wrong target   this has to be cleaned up when the wrong target is

chosen this includes register renames

(c) Chris Jesshope 78 November 16, 10

Grohoski’s estimate

♣  1 in 5 instructions are branches   1 in 3 branches are unconditional - fully predictable   1in 3 branches are loop closing conditional - predictable

♥  n-1 of these are taken only the last is not   1 in 3 branches are random branches - not so predictable

♥  50% of these are taken - on average if truly random   therefore 1 in 6 branches (1 in 30 instrs.) cause problems

and it is unlikely that the pipe will ever remain full ♥  even with 8-way issue IPC will not get much above 2-3

♣  This requires sophisticated branch prediction algorithms

Page 40: 2.Pipeline Processors

Chris Jesshope!

40!

(c) Chris Jesshope 79 November 16, 10

Dynamic branch prediction

♣  Branch prediction can be used to improve the all but random branches

♣  The prediction is stored in the I cache, a branch history table (BHT) or branch target buffer (BTB)   A 1 bit predictor associated with an instruction address or

partial address tells us what the last branch did ♥  i.e. taken (T) or not taken (N) which is used to predict the next branch

  A 1-bit predicted loop will fail on the first and last iterations ♥  initially the entry is N, after the first iteration the entry is T and will

remain at T till the last iteration, when it will fail again as the branch is not taken

♥  NTTTTT…TTTN

(c) Chris Jesshope 80 November 16, 10

2-bit predictors – bimodal counters ♣  Use two bits to represent the last two attempts (~ 90%

accurate) … there are various schemes   TT TN NT NN are the predictor states   E.g. change prediction only if miss-predict twice but

return in one step - one of several strategies

TT! TN!

NT! NN!

Predict taken! Predict taken!

Predict not taken! Predict not taken!

T!

T!T!

T!

N!N!

N!

N!

Page 41: 2.Pipeline Processors

Chris Jesshope!

41!

(c) Chris Jesshope 81 November 16, 10

Branch history table/target buffer

♣  Stores the prediction state in a table, either associatively addressed or indexed on small number of address bits   Can also store branch target if it is associative   Get prediction at IF stage and update prediction when

condition is resolved

address! TT!

PC!

Prediction/target!

target!

N.b. in a loop, on every conditional branch, the same address is calculated!

(c) Chris Jesshope 82 November 16, 10

Correlated or global predictors ♣  There may be correlation between different branches, e.g. nested loops ♣  Normally predictors are indexed on address bits of the branch instruction ♣  Correlation can be tracked by so-called global predictors that maintain a

register of the history of recent branches taken and use this to address the prediction

♣  This scheme, by itself, is only better than the bimodal scheme for large table sizes, and is never as good as local prediction.

♣  Alternatively, can index the table of bimodal counters with the recent history concatenated with a few bits of the branch instruction's address - Gselect predictor. Or XOR the recent history with the instruction address

Page 42: 2.Pipeline Processors

Chris Jesshope!

42!

Example microprocessors

(c) Chris Jesshope 84 November 16, 10

Compaq (DEC) Alpha 21264 - EV6

♣  The EV6 was the last Alpha microprocessor to be manufactured   alpha has a very clean RISC ISA that uses separate

integer and floating-point register files   alpha was unique in supporting a high clock rate and

short pipeline through good ISA and silicon design

♣  EVA6 uses out-of-order issue in a 7 stage pipeline   4 instructions per cycle can be fetched (speculatively)

and up to 6 instructions issued out of order   sophisticated branch predictor   uses scoreboarding and explicit renaming techniques to

track dependencies and avoid false dependencies

Page 43: 2.Pipeline Processors

Chris Jesshope!

43!

(c) Chris Jesshope 85 November 16, 10

Alpha 21264 pipeline

(c) Chris Jesshope 86 November 16, 10

Saving ports of the RF

♣  Later in this section we will see that the register file area grows quadratically with number of ports   number of ports proportional to the number of functional

units allowed to write in one cycle   multiple issue processors requires two read and one write

port per concurrent functional unit

♣  Ports can be minimised by   separating floating point and integer operations as long

as data can be moved between the two   duplicating registers

♣  The Alpha uses both techniques 72 Fl.Pt. registers and 80 integer registers which are duplicated to minimise ports

Page 44: 2.Pipeline Processors

Chris Jesshope!

44!

(c) Chris Jesshope 87 November 16, 10

80!Integer!

registers!

functional!unit!

functional!unit!

80!Integer!

registers!

functional!unit!

functional!unit!

Alpha groups two functional units with each of two register files!

4 read and 4 write ports!Area = 2*c*82 =128c!

Instead of having one file (8 read and 4 write ports)!

Area = c*122 = 144c!

Only a small saving in area but this also aids locality of signals on the critical read-to-functional unit path!

c is a constant based on line width/spacing!

(c) Chris Jesshope 88 November 16, 10

Stage 0 - branch prediction ♣  Hybrid tournament branch

predictor ♣  90-100% accuracy

Local!History!table!

1024x10!

Global !prediction!1024x2!

PC!

Local!prediction!

1024x3!

Choice!Prediction!

4096x2!

prediction!

Path history!

23 Kbits in total!

Page 45: 2.Pipeline Processors

Chris Jesshope!

45!

(c) Chris Jesshope 89 November 16, 10

Instruction fetching

♣  I-cache is a 2-way set associative cache   a cache block fetch is four instructions   it uses line and set prediction

♥  it predicts where to fetch a block ♥  in which set to place the block

  accuracy of 85%   line miss-prediction cost typically 1 cycle bubble

(c) Chris Jesshope 90 November 16, 10

AMD athlon ♣  An implementation of the X86 instruction set ♣  3 instructions fetched speculatively

  uses branch history table and branch target buffer

♣  Uses reorder buffer for renaming integer operations merged with architectural register file

♣  Long floating point operations are renamed and scheduled independently   floating point pipes also implement MMX extensions

♣  Uses 8-way banked L1 D-cache   both I- and D-cache are 2 way set associative

Page 46: 2.Pipeline Processors

Chris Jesshope!

46!

(c) Chris Jesshope 91 November 16, 10

AMD athlon

AMD 64-bit architecture

♣  Opteron introduced in 2003 moved AMD to a 64 bit architecture

♣  In 2005 AMD introduced the first dual core processor

♣  In 2007 AMD introduced the quad core and now offer a 6-core version

(c) Chris Jesshope 92 11/16/10

Page 47: 2.Pipeline Processors

Chris Jesshope!

47!

6-core opteron

(c) Chris Jesshope 93 11/16/10

904M transistors 346mm2!

45nm technology!

6-core opteron

(c) Chris Jesshope 94 11/16/10

Page 48: 2.Pipeline Processors

Chris Jesshope!

48!

(c) Chris Jesshope 95 November 16, 10

Power PC G5 (IBM/Apple)

♣  The PowerPC G5 was fabricated using 0.130 µm (130 nm) circuitry … about 350 meters of wiring!

♣  In-order dispatch of up to five operations into distributed issue queue structure - reservation stations

♣  Simultaneous issue of up to 10 out-of-order operations to 12 functional units N.b in-order dispatch / out-of-order issue!

(c) Chris Jesshope 96 November 16, 10

G5 Specifications ♣  Velocity engine - 128 bits wide ♣  Other datapaths - 64 bit ♣  64K L1 I-cache ♣  32K L1 D-cache ♣  512 Kbyte L2 cache ♣  Uses hybrid branch predictor like Alpha ♣  Suport for up to 8 outstanding L1 cache misses ♣  Hardware-initiated instruction prefetching from

L2 cache ♣  Hardware- or software-initiated data stream

prefetching - look at this later   support for up to eight active streams

Page 49: 2.Pipeline Processors

Chris Jesshope!

49!

(c) Chris Jesshope 97 November 16, 10

Intel Pentium 4

♣  This has a very deep pipeline and hence high clock rate - 4GHz compared to the G5’s 2GHz

♣  Problems exacerbated due to the X86 CISC instruction set, which is not suitable for pipelining

♣  X86 instructions are translated into µops   these are regular and uniform - like a traditional

RISC ISA   trace cache caches translated µop sequences along

program execution paths

(c) Chris Jesshope 98 November 16, 10

♣  Decode stage translates X86 instructions into µops

♣  Trace cache stores µop traces   i.e. a cache of instructions

as executed rather than as stored in memory

♣  Branch history table branch target buffer + static prediction

♣  Now adopts simultaneous multi-threading - hyperthreading   Fetches instructions

simultaneously from two threads

Pentium P4 pipeline overview!

Page 50: 2.Pipeline Processors

Chris Jesshope!

50!

(c) Chris Jesshope 99 November 16, 10

Trace cache

♣  Cache branch sequences rather than memory sequences -   i.e. the sequence of instructions executed rather than

their order in memory

(c) Chris Jesshope 100 November 16, 10

Pentium 4 pipeline ♣  6-way out-of-order execution 20 stage pipeline

  2 generations compared below (note the superpipelining) ♣  126 entry reorder buffer (registers and result status)

Page 51: 2.Pipeline Processors

Chris Jesshope!

51!

Multi-core ♣  Pentium 4D increased the pipeline length to 31

stages it an attempt to push clock speeds to 4GHz ♣  Since then Intel has moved to multi-core - shorter

pipelines - slower clocks   2006 Core 2 duo 2.93 GHz, 291M transistors, 65nm   2007 Quad core Xeon 2.66GHz, 582M transistors,

65nm   2007 Quad core Xeon Penryn >3GHz, 820M

transistors, 45nm   2008 Quad Core i7 Nehalem, 731M transistors,

45nm

(c) Chris Jesshope 101 11/16/10

Penryn

♣  2 or 4 cores ♣  14 stage pipeline ♣  4-way instruction issue ♣  No hyperthreading ♣  Large 12 Mbyte of L2 cache

(c) Chris Jesshope 102 11/16/10

Page 52: 2.Pipeline Processors

Chris Jesshope!

52!

Intel Core i7 - Nehalem

♣  2 or 4 core up to 3GHz ♣  14 stage pipeline with stream prefetching ♣  Return of hyper-threading ♣  32 KB L1 I-cache & 32 KB L1 D-cache (8-way

set associative) ♣  256 KB L2 cache per core (8-way set associative) ♣  Shared 8 MB L3 cache (16-way associative)

(c) Chris Jesshope 103 11/16/10

(c) Chris Jesshope 104 11/16/10

Page 53: 2.Pipeline Processors

Chris Jesshope!

53!

Nehalem

(c) Chris Jesshope 105 11/16/10

Evolution of Intel processors

(c) Chris Jesshope 106 11/16/10

Download from:!http://download.intel.com/pressroom/kits/IntelProcessorHistory.pdf!

Page 54: 2.Pipeline Processors

Chris Jesshope!

54!

Architectural Optimisations

(c) Chris Jesshope 108 November 16, 10

Application-specific optimisations ♣  Multimedia applications require intensive compute power

  They are also very regular in their structure, I.e. they perform the same operations on large regular data sets

♣  A number of optimisations can be made to accelerate these applications - we consider two   SIMD extensions   Streaming data

Page 55: 2.Pipeline Processors

Chris Jesshope!

55!

(c) Chris Jesshope 109 November 16, 10

SIMD extensions ♣  Many modern microprocessor provide SIMD extensions for

multimedia processing   E.g. MMX, Altivec, Velocity engine

♣  Multimedia applications have specific characteristics   low bit precision (often 8 or 16 bits)   regular concurrent operations (vector operations)   MACs - no not burgers but multiply-accumulates

♥  This results from inner product operations in linear algebra

(c) Chris Jesshope 110 November 16, 10

Example of SIMD extensions

♣  Example - PPC G5 Processor   Altivec - Velocity Engine processes up to 128 bits of

data as: ♥  4 by 32-bit integers, ♥  8 by 16-bit integers, ♥  16 by 8-bit integers, or ♥  4 by 32-bit single-precision floating-point values

  Pipelined single-cycle operations

Page 56: 2.Pipeline Processors

Chris Jesshope!

56!

(c) Chris Jesshope 111 November 16, 10

SIMD instructions

Also need to be able to shift data to align it for processing where indices are offset!

(c) Chris Jesshope 112 November 16, 10

Register set ♣  Like the separation of Integer and floating point in a general purpose

processor, SIMD extensions may use existing or dedicated registers   Altivec 128 bit SIMD extensions use separate registers

♣  Usually provide both modulo and saturated arithmetic   modulo wraps around in the field specified   saturation as the name suggests is limited out at either

end of the range ♣  The extensions of course need compiler support

  support varies from automatic vectorisation…   … to support for vector types and operations

♣  the many instruction sets does not help portability in the latter case

Page 57: 2.Pipeline Processors

Chris Jesshope!

57!

(c) Chris Jesshope 113 November 16, 10

SIMD computers ♣  Historically there have been many pure SIMD computers built:

  Illiac 4 was a 64 array of 64bit processors circa 1975   ICL DAP was a 4096 array of 1 bit processor circa 1980

♣  The technique of reconfiguration is also not new:   ICL DAP could be configured between 4096 1 bit processors and

32 32 bit processors   The RPA (1984-89) was more flexible:

♥  Jesshope C R et. al. (1989) Design of SIMD microprocessor array, in Computers and Digital Techniques, IEE Proceedings E, 136(3), pp197-204, ISSN: 0143-7062

♥  Rushton R (1989) Reconfigurable Processor-Array: A Bit-Sliced Parallel Computer, MIT press (Ed Jesshope, C R) ISBN:0262680572

(c) Chris Jesshope 114 November 16, 10

Streaming applications

♣  Multimedia (and other streaming applications) suffer badly from compulsary cache misses   computation uses a stream of data that is used once and

transformed and never used again   very little opportunity for reuse

♣  The data stream is very regular however   typically data addresses are of the form:

Si = A0 + i*stride   where stride defines the offset between subsequent locations

♣  A solution is to bring data into cache before it is required   this is called stream prefetching

Page 58: 2.Pipeline Processors

Chris Jesshope!

58!

(c) Chris Jesshope 115 November 16, 10

Stream prefetching

♣  The process requires two actions: 1.  detecting the stream - i.e. detecting regular strides

between memory accesses 2.  issuing prefetch instructions to bring data into cache

♣  The options are:   static detection/static prefetch - SW (user or compiler

generated)   static detection dynamic prefetch - HW/SW   dynamic detection dynamic prefetch - HW solution

(c) Chris Jesshope 116 November 16, 10

Static detection static prefetch

♣  Cheap and cheerful   Requires ISA that supports prefetch

♥  prefetch instruction touches the cache with required address before data is required but does not load a register

  Disadvantages ♥  requires additional instructions to be executed - critical in real-time

signal processing ♥  frequency of access is based on cache/memory parameters and

requires loop unrolling - can interfere with other optimisations ♥  prefetch not binding, i.e. can be swapped out again by another miss

into cache - depends on replacement policy

Page 59: 2.Pipeline Processors

Chris Jesshope!

59!

(c) Chris Jesshope 117 November 16, 10

Dynamic detection dynamic issue ♣  Requires hardware to detect strides between

consecutive loads for a given instruction   Many streams may be interleaved

♥  Stride-prediction table (SPT) keeps history of instruction address and memory address and detects stream candidates

♣  Once candidates are identified can touch the cache without an instruction slot in the pipeline

(c) Chris Jesshope 118 November 16, 10

Dynamic solution

♣  Advantages   no programmer intervention - supports legacy code

♣  Disadvantages   large SPT required   even larger SPT if we perform loop unrolling   still have a potential for cache thrashing

♣  To avoid thrashing, separate stream caches can be used   replacement policy for stream cache is different for

normal locality ♥  e.g use evict most recently used rather than least recently used

  can also look at longer history sequences

Page 60: 2.Pipeline Processors

Chris Jesshope!

60!

(c) Chris Jesshope 119 November 16, 10

Static detection dynamic issue ♣  Static detection of streams is trivial as it is implicit in loop

parameters ♣  Again either compiler or programmer signals which streams to

prefetch   instead of using prefetch instructions the instructions set up a

prefetch engine to prefetch the stream   requires only a small prefetch table   no additional instructions required in the inner loop

Technology issues

Page 61: 2.Pipeline Processors

Chris Jesshope!

61!

(c) Chris Jesshope 121 November 16, 10

Cancelled projects

♣  Both Intel and Compaq have cancelled wide superscaler projects recently   pentium Netburst and Alpha 21464

♣  The trend is now to develop independent microprocessors on chip - multi-cores   this is covered in the next section on explicit

concurrency

♣  But why is this?

(c) Chris Jesshope 122 November 16, 10

Technology issues

♣  Moores law - applied to CMOS silicon   gate concurrency - scales as ekt (0.35 ≤ k ≤ 0.46)

♥  t in years i.e. doubling period 1.5 to 2 years   gate speed - scales at ekt/2

  most performance has come from gate speed with density providing support in the form of large caches

♣  Industry predicts 10 to 15 years of scaling left from the SIA roadmap   pessimistically 32 fold concurrency and 6 fold speed   optimistically 1024 fold concurrency and 32 fold speed

Page 62: 2.Pipeline Processors

Chris Jesshope!

62!

(c) Chris Jesshope 123 November 16, 10

Power dissipated

♣  Power dissipated is a silicon CMOS circuit comprises several components and the major component has been dynamic power

♣  Dynamic power = kV2<f>

Ronan et. al. (2001) Coming Challenges in Microarchitecture and Architecture, Proc IEEE, 89 (3) pp325-340

(c) Chris Jesshope 124 November 16, 10

Architecture ♣  Past performance gains came mostly from clock rate

  clock improved faster than circuit speed - super-pipelining

♣  Gate concurrency has not translated to instruction issue width - the example below is quite revealing   1987 - T800 transputer 0.5M transistors

♥  20MHz and single issue 64-bit floating point   2004 - P4 Extreme - 178 M transistors

♥  3GHz and 6-way issue

Scaling Concurrency Speed

Predicted from Moore’s law 387 20

Transistors scaling 356 ?

Instruction issue scaling 6 150

From known scaling, instead of one P4 we could be have 387 400Mhz T800 transputers Speed 7-8 times expected - superpipelining!

Issue width 64 times less than edxpected!

Page 63: 2.Pipeline Processors

Chris Jesshope!

63!

(c) Chris Jesshope 125 November 16, 10

Voltage - frequency scaling

♣  Within some voltage range:   frequency increases with supply voltage - k1Vµ-1 µ ≥ 1   but power increases more quickly - k2V µ+2

♣  This can be used to trade performance and power by reducing the voltage   If frequency is reduced voltage can be reduced and hence

power dissipated is reduced

♣  With scalable concurrency - concurrent execution can reduce the total energy of a computation by lowering the required frequency

(c) Chris Jesshope 126 November 16, 10

Signal propagation

130 nm

100 nm

70 nm

35 nm

20 mm chip edge

Analytically … Qualitatively …

Bits reachable in one clock cycle clock cycle defined by 8, 16 or optimal (fSIA) number of fan-out-of-4 gates

Page 64: 2.Pipeline Processors

Chris Jesshope!

64!

(c) Chris Jesshope 127 November 16, 10

Superscalar support structures

♣  Superscaler register access is not scalable   for n instructions issued per cycle the number of registers

grow as O(n) - otherwise run out of registers   for n read/write ports in a register file a one-bit register

cell area grows as O(n2)   register file area and latency therefore grow as O(n3)

♣  Superscaler issue is not scalable   it requires logic which grows at least O(n2) to issue n

instructions per cycle   multiported rename logic must be able to rename the

same register multiple times in one cycle   rename logic is one of the key complexities in multi-issue!

♣  Power dissipated in these structures does not scale well

(c) Chris Jesshope 128 November 16, 10

More architecture ♣  The high profile projects cancellations were due to power density

and circuit complexity issues arrising from the poor scalability of dynamic concurrency - out-of-order instruction issue   NetBurst, Pentium 4 - Xeon Evolution “The chips are apparently too hot to go the market”   The Alpha 21464 8-way issue SMT superscaler

♣  This should not have been a surprise!   it was predicted in 1999 that the Alpha 21464’s instruction issue

would consume 25% of the chips power budget

Page 65: 2.Pipeline Processors

Chris Jesshope!

65!

(c) Chris Jesshope 129 November 16, 10

Alpha 21464 project

♣  A quick look at the cancelled Alpha 21464   R P Peterson et. al. (2002) Design of an 8-wide superscalar RISC microprocessor with

simultaneous multithreading, ISSC Digest and Visuals supplement.

♣  This was a 64 bit microprocessor designed to support 4 threads and to issue/retire 8 instructions out-of-order to 12 functional units in every cycle

♣  The floor plan on the next slide shows the large area dedicated to instruction issue and register file

Alpha 21464!

Page 66: 2.Pipeline Processors

Chris Jesshope!

66!

(c) Chris Jesshope 131 November 16, 10

Summary ♣  Out-of-order issue exploits instruction-level concurrency from a sequential

instruction stream (implicit concurrency)   it supports a large number of instructions in execution simultaneously in

multiple functional units   instructions are dynamically scheduled at the issue by executing out of

order while honouring dependencies   dependencies introduced by completing and issuing instructions out of

order are removed by register renaming

♣  Out-of-order issue is costly and only appropriate for low levels of concurrency and is inefficient on irregularly branching code   difficult to get an IPC of much more than 2 even for 8-way issue

♣  The clear advantage is concurrency with binary-code compatibility

These issues should be considered in programming for efficiency!

(c) Chris Jesshope 132 November 16, 10

CPU Frequencies

Supercomputer performance

0.0010 0.0100 0.1000 1.0000 10.0000 100.0000 1000.0000 10000.0000 100000.0000 1000000.0000

1980 1985 1990 1995 2000 2005 2010 Year

FLOPS

100M 1G 10G 100G

10T 100T 1P

10M

1T

100MHz 1GHz 10GHz

10MHz

X-MP VP-200 S-810/20 S-820/80 VP-400 SX-2 CRAY-2 Y-MP8 VP2600/1 0

SX-3 C90

SX-3R S-3800

SR2201

SR2201/2K CM-5 Paragon

T3D T90

T3E NWT/166 VPP500 SX-4 ASCI Red VPP700

SX-5 SR8000

SR8000G1 SX-6 ASCI Blue ASCI Blue Mountain VPP800 VPP5000 ASCI White ASCI Q Earth Simulator

1M

History of High Performance Computers

Data courtesy Jack Dongarra

Single CPU Performance

System Parallelism

Pipelining & Instruction parallelism

Moore’s law