CS 152 Computer Architecture and Engineering Lecture 13 - Out-of-Order Issue, Register Renaming, &...

CS 152 Computer Architecture and

Engineering

Lecture 13 - Out-of-Order Issue,

Register Renaming,

& Branch Prediction

Krste AsanovicElectrical Engineering and Computer Sciences

University of California at Berkeley

http://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.edu/~cs152

3/12/2009 CS152-Spring’09 2

Last time in Lecture 12• Pipelining is complicated by multiple and/or variable

latency functional units

• Out-of-order and/or pipelined execution requires tracking of dependencies

– RAW

– WAR

– WAW

• Dynamic issue logic can support out-of-order execution to improve performance

– Last time, looked at simple scoreboard to track out-of-order completion

• Hardware register renaming can further improve performance by removing hazards.

3/12/2009 CS152-Spring’09 3

Out-of-Order Issue

• Issue stage buffer holds multiple instructions waiting to issue.

• Decode adds next instruction to buffer if there is space and the instruction does not cause a WAR or WAW hazard.

– Note: WAR possible again because issue is out-of-order (WAR not possible with in-order issue and latching of input operands at functional unit)

• Any instruction in buffer whose RAW hazards are satisfied can be issued (for now at most one dispatch per cycle). On a write back (WB), new instructions may get enabled.

IF ID WB

ALU Mem

3/12/2009 CS152-Spring’09 4

Overcoming the Lack of Register Names

Floating Point pipelines often cannot be kept filled with small number of registers.

IBM 360 had only 4 floating-point registers

Can a microarchitecture use more registers than specified by the ISA without loss of ISA compatibility ?

Robert Tomasulo of IBM suggested an ingenious solution in 1967 using on-the-fly register renaming

3/12/2009 CS152-Spring’09 5

Instruction-level Parallelism via Renaming

latency1 LD F2, 34(R2) 1

2 LD F4, 45(R3) long

3 MULTD F6, F4, F2 3

4 SUBD F8, F2, F2 1

5 DIVD F4’, F2, F8 4

6 ADDD F10, F6, F4’ 1

In-order: 1 (2,1) . . . . . . 2 3 4 4 3 5 . . . 5 6 6Out-of-order: 1 (2,1) 4 4 5 . . . 2 (3,5) 3 6 6

Any antidependence can be eliminated by renaming. (renaming additional storage) Can it be done in hardware? yes!

3/12/2009 CS152-Spring’09 6

Register Renaming

• Decode does register renaming and adds instructions to the issue stage reorder buffer (ROB)

renaming makes WAR or WAW hazards impossible

• Any instruction in ROB whose RAW hazards have been satisfied can be dispatched.

Out-of-order or dataflow execution

IF ID WB

ALU Mem

3/12/2009 CS152-Spring’09 7

Dataflow execution

Instruction slot is candidate for execution when:• It holds a valid instruction (“use” bit is set)• It has not already started execution (“exec” bit is clear)• Both operands are available (p1 and p2 are set)

Reorder buffer

ptr2 next to

deallocate

nextavailable

Ins# use exec op p1 src1 p2 src2

3/12/2009 CS152-Spring’09 8

Renaming & Out-of-order IssueAn example

• When are tags in sources replaced by data?

• When can a name be reused?

1 LD F2, 34(R2)2 LD F4, 45(R3)3 MULTD F6, F4, F24 SUBD F8, F2, F25 DIVD F4, F2, F86 ADDD F10, F6, F4

Renaming table Reorder bufferIns# use exec op p1 src1 p2 src2

data / ti

p dataF1F2F3F4F5F6F7F8

Whenever an FU produces data

Whenever an instruction completes

1 1 0 LD

2 1 0 LD

5 1 0 DIV 1 v1 0 t4 4 1 0 SUB 1 v1 1 v1

3 1 0 MUL 0 t2 1 v1

1 1 1 LD 0

4 1 1 SUB 1 v1 1 v1 4 0

5 1 0 DIV 1 v1 1 v4

2 1 1 LD 2 0

3 1 0 MUL 1 v2 1 v1

3/12/2009 CS152-Spring’09 9

Data-Driven ExecutionRenaming table &reg file

Reorder buffer

Load Unit

FU FU Store Unit

< t, result >

Ins# use exec op p1 src1 p2 src2 t1

• Instruction template (i.e., tag t) is allocated by the Decode stage, which also associates tag with register in regfile• When an instruction completes, its tag is deallocated

Replacing the tag by its valueis an expensive operation

3/12/2009 CS152-Spring’09 10

Simplifying Allocation/Deallocation

Instruction buffer is managed circularly•“exec” bit is set when instruction begins execution •When an instruction completes its “use” bit is marked free• ptr2 is incremented only if the “use” bit is marked free

Reorder buffer

ptr2 next to

deallocate

nextavailable

Ins# use exec op p1 src1 p2 src2

3/12/2009 CS152-Spring’09 11

IBM 360/91 Floating-Point UnitR. M. Tomasulo, 1967

123456

loadbuffers(from memory)

FloatingPointReg

store buffers(to memory)

instructions

Common bus ensures that data is made available immediately to all the instructions waiting for it.Match tag, if equal, copy value & set presence “p”.

distribute instruction templatesby functionalunits

< tag, result >

p tag/datap tag/datap tag/data

p tag/datap tag/data

p tag/datap tag/data2

p tag/datap tag/datap tag/data

p tag/datap tag/datap tag/datap tag/data

3/12/2009 CS152-Spring’09 12

Effectiveness?

Renaming and Out-of-order execution was firstimplemented in 1969 in IBM 360/91 but did not show up in the subsequent models until mid-Nineties.

Why ?Reasons

1. Effective on a very small class of programs2. Memory latency a much bigger problem3. Exceptions not precise!

One more problem needed to be solved

Control transfers

3/12/2009 CS152-Spring’09 13

Precise Interrupts

It must appear as if an interrupt is taken between two instructions (say Ii and Ii+1)

• the effect of all instructions up to and including Ii is totally complete• no effect of any instruction after Ii has taken place

The interrupt handler either aborts the program or restarts it at Ii+1 .

3/12/2009 CS152-Spring’09 14

Effect on InterruptsOut-of-order Completion

I1 DIVD f6, f6, f4I2 LD f2, 45(r3)I3 MULTD f0, f2, f4I4 DIVD f8, f6, f2I5 SUBD f10, f0, f6I6 ADDD f6, f8, f2

out-of-order comp 1 2 2 3 1 4 3 5 5 4 6 6 restore f2 restore f10

Consider interrupts

Precise interrupts are difficult to implement at high speed- want to start execution of later instructions before exception checks finished on earlier instructions

3/12/2009 CS152-Spring’09 15

Exception Handling(In-Order Five-Stage Pipeline)

• Hold exception flags in pipeline until commit point (M stage)• Exceptions in earlier pipe stages override later exceptions• Inject external interrupts at commit point (override others)• If exception at commit: update Cause and EPC registers, kill all stages, inject handler PC into fetch stage

Asynchronous Interrupts

PCInst. Mem D Decode E M

Data Mem W+

Kill D Stage

Kill F Stage

Kill E Stage

Illegal Opcode Overflow

Data Addr Except

PC Address Exceptions

Kill Writeback

Select Handler

Commit Point

3/12/2009 CS152-Spring’09 16

Fetch: Instruction bits retrieved from cache.

Phases of Instruction Execution

I-cache

Fetch Buffer

IssueBuffer

Func.Units

Arch.State

Execute: Instructions and operands sent to execution units . When execution completes, all results and exception flags are available.

Decode: Instructions placed in appropriate issue (aka “dispatch”) stage buffer

ResultBuffer Commit: Instruction irrevocably updates

architectural state (aka “graduation” or “completion”).

3/12/2009 CS152-Spring’09 17

In-Order Commit for Precise Exceptions

• Instructions fetched and decoded into instruction reorder buffer in-order• Execution is out-of-order ( out-of-order completion)• Commit (write-back to architectural state, i.e., regfile & memory, is in-order

Temporary storage needed to hold results before commit (shadow registers and store buffers)

Fetch Decode

Execute

CommitReorder Buffer

In-order In-orderOut-of-order

KillKill Kill

Exception?Inject handler PC

3/12/2009 CS152-Spring’09 18

Extensions for Precise Exceptions

Reorder buffer

next tocommit

nextavailable

• add <pd, dest, data, cause> fields in the instruction template• commit instructions to reg file and memory in program order buffers can be maintained circularly

• on exception, clear reorder buffer by resetting ptr1=ptr2(stores must wait for commit before updating memory)

Inst# use exec op p1 src1 p2 src2 pd dest data cause

3/12/2009 CS152-Spring’09 19

Rollback and Renaming

Register file does not contain renaming tags any more.How does the decode stage find the tag of a source register?

Search the “dest” field in the reorder buffer

Register File(now holds only committed state)

Reorderbuffer

Load Unit

FU FU FU Store Unit

< t, result >

Ins# use exec op p1 src1 p2 src2 pd dest data

Commit

3/12/2009 CS152-Spring’09 20

Renaming Table

Register File

Reorder buffer

Load Unit

FU FU FU Store Unit

< t, result >

Ins# use exec op p1 src1 p2 src2 pd dest data

Commit

Rename Table

Renaming table is a cache to speed up register name look up. It needs to be cleared after each exception taken. When else are valid bits cleared? Control transfers

r1 t vr2

tagvalid bit

3/12/2009 CS152-Spring’09 21

CS152 Administrivia

3/12/2009 CS152-Spring’09 22

I-cache

Fetch Buffer

IssueBuffer

Func.Units

Arch.State

Execute

Decode

ResultBuffer Commit

Branchexecuted

Next fetch started

Modern processors may have > 10 pipeline stages between next PC calculation and branch resolution !

Control Flow Penalty

How much work is lost if pipeline doesn’t follow correct instruction flow?

~ Loop length x pipeline width

3/12/2009 CS152-Spring’09 23

Instruction Taken known? Target known?

BEQZ/BNEZ

MIPS Branches and Jumps

Each instruction fetch depends on one or two pieces of information from the preceding instruction:

1) Is the preceding instruction a taken branch?

2) If so, what is the target address?

After Inst. Decode

After Inst. Decode After Inst. Decode

After Inst. Decode After Reg. Fetch

After Reg. Fetch*

*Assuming zero detect on register read

3/12/2009 CS152-Spring’09 24

Branch Penalties in Modern Pipelines

A PC Generation/MuxP Instruction Fetch Stage 1F Instruction Fetch Stage 2B Branch Address Calc/Begin DecodeI Complete DecodeJ Steer Instructions to Functional unitsR Register File ReadE Integer Execute

Remainder of execute pipeline (+ another 6 stages)

UltraSPARC-III instruction fetch pipeline stages(in-order issue, 4-way superscalar, 750MHz, 2000)

Branch Target Address Known

Branch Direction &Jump Register Target Known

3/12/2009 CS152-Spring’09 25

Reducing Control Flow Penalty

Software solutions• Eliminate branches - loop unrolling

Increases the run length • Reduce resolution time - instruction scheduling

Compute the branch condition as early as possible (of limited value)

Hardware solutions• Find something else to do - delay slots

Replaces pipeline bubbles with useful work

(requires software cooperation)• Speculate - branch prediction

Speculative execution of instructions beyond the branch

3/12/2009 CS152-Spring’09 26

Branch Prediction

Motivation:Branch penalties limit performance of deeply pipelined processors

Modern branch predictors have high accuracy(>95%) and can reduce branch penalties significantly

Required hardware support:Prediction structures:

• Branch history tables, branch target buffers, etc.

Mispredict recovery mechanisms:• Keep result computation separate from commit• Kill instructions following branch in pipeline• Restore state to state following branch

3/12/2009 CS152-Spring’09 27

Static Branch Prediction

Overall probability a branch is taken is ~60-70% but:

ISA can attach preferred direction semantics to branches, e.g., Motorola MC88110

bne0 (preferred taken) beq0 (not taken)

ISA can allow arbitrary choice of statically predicted direction, e.g., HP PA-RISC, Intel IA-64 typically reported as ~80% accurate

JZbackward

90%forward

3/12/2009 CS152-Spring’09 28

Dynamic Branch Predictionlearning based on past behavior

Temporal correlationThe way a branch resolves may be a good predictor of the way it will resolve at the next execution

Spatial correlation Several branches may resolve in a highly correlated manner (a preferred path of execution)

3/12/2009 CS152-Spring’09 29

• Assume 2 BP bits per instruction• Change the prediction after two consecutive mistakes!

¬takewrong

taken¬ taken

taken¬takeright

takeright

takewrong

¬ taken

¬ taken¬ taken

BP state:(predict take/¬take) x (last prediction right/wrong)

Branch Prediction Bits

3/12/2009 CS152-Spring’09 30

Branch History Table

4K-entry BHT, 2 bits/entry, ~80-90% correct predictions

0 0Fetch PC

Branch? Target PC

I-Cache

Opcode offset

Instruction

BHT Index

2k-entryBHT,2 bits/entry

Taken/¬Taken?

3/12/2009 CS152-Spring’09 31

Acknowledgements

• These slides contain material developed and copyright by:

– Arvind (MIT)

– Krste Asanovic (MIT/UCB)

– Joel Emer (Intel/MIT)

– James Hoe (CMU)

– John Kubiatowicz (UCB)

– David Patterson (UCB)

• MIT material derived from course 6.823

• UCB material derived from course CS252

CS 152 Computer Architecture and Engineering Lecture 13 - Out-of-Order Issue, Register Renaming, &...

Documents

Transcript of CS 152 Computer Architecture and Engineering Lecture 13 - Out-of-Order Issue, Register Renaming, &...

Asanovic/Devadas Spring 2002 6.823 Complex Instruction Set Evolution in the Sixties: Stack and GPR Architectures Krste Asanovic Laboratory for Computer.

© Krste Asanovic, 2014CS252, Spring 2014, Lecture 10 CS252 Graduate Computer Architecture Spring 2014 Lecture 10: Memory Krste Asanovic krste@eecs.berkeley.edu.

Asanovic/Devadas Spring 2002 6.823 Simple Instruction Pipelining Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology.

RAMP Gold: ParLab InfiniCore Model Krste Asanovic UC Berkeley RAMP Retreat, January 16, 2008.

1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.

Asanovic/Devadas Spring 2002 6.823 Advanced CISC Implementations: Pentium 4 Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology.

Asanovic/Devadas Spring 2002 6.823 VLIW/EPIC: Statically Scheduled ILP Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology.

6.893: Advanced VLSI Computer Architecture, September 7, 2000, Lecture 1, Slide 1. © Krste Asanovic Krste Asanovic krste@lcs.mit.edu

RAMP Gold Update Zhangxi Tan, Krste Asanovic, David Patterson UC Berkeley August, 2008.

© Krste Asanovic, 2014CS252, Spring 2014, Lecture 17 CS252 Graduate Computer Architecture Spring 2014 Lecture 17: I/O Krste Asanovic krste@eecs.berkeley.edu.

Virtual Machines and Dynamic Translation: …...Asanovic/Devadas Spring 2002 6.823 Virtual Machines and Dynamic Translation: Implementing ISAs in Software Krste Asanovic Laboratory

Krste Asanovic Laboratory for Computer Science ... · Asanovic/Devadas Spring 2002 Instruction Set Architecture 6.823 (ISA) versus Implementation ISA is the hardware/software interface

© Krste Asanovic, 2015CS252, Spring 2015, Lecture 13 CS252 Graduate Computer Architecture Fall 2015 Lecture 13: Multithreading Krste Asanovic krste@eecs.berkeley.edu.

© Krste Asanovic, 2015CS252, Fall 2015, Lecture 3 CS252 Graduate Computer Architecture Fall 2015 Lecture 3: CISC versus RISC Krste Asanovic krste@eecs.berkeley.edu.

Clock and Power 6.375 Complex Digital Systems Krste Asanovic March 7, 2007.

Final Exam May 8th, 2018 Professor Krste Asanovic Name:inst.eecs.berkeley.edu/~cs152/sp18/handouts/sp18-final-sol.pdfProfessor Krste Asanovic Name:_____ This is a closed book, closed

© Krste Asanovic, 2014CS252, Spring 2014, Lecture 5 CS252 Graduate Computer Architecture Spring 2014 Lecture 5: Out-of-Order Processing Krste Asanovic.

Chisel Tutorial - University of California, BerkeleyChisel Tutorial Jonathan Bachrach, Krste Asanovic, John Wawrzynek´ EECS Department, UC Berkeley fjrb|krste|johnwg@eecs.berkeley.edu

Asanovic/Devadas Spring 2002 6.823 6.823 Computer System Architecture Lecturers: Krste Asanovic, Srini Devadas.

Asanovic/Devadas Spring 2002 6.823 Influence of Technology and Software on Instruction Sets: Up to the dawn of IBM 360 Krste Asanovic Laboratory for Computer.