ECE 486/586 Computer Architecture Lecture # 17web.cecs.pdx.edu/~zeshan/ece586_lec17.pdf ·...
Transcript of ECE 486/586 Computer Architecture Lecture # 17web.cecs.pdx.edu/~zeshan/ece586_lec17.pdf ·...
ECE 486/586
Computer Architecture
Lecture # 17
Spring 2019
Portland State University
Lecture Topics
• Branch Prediction– Tournament Predictors
– Branch Target Buffer (BTB)
– Return Address Stack (RAS)
• Speculative Execution
Reference:
• Chapter 3: Sections 3.3, 3.6, 3.9
Tournament Predictors: Adaptively Combining Local and Global Predictors
• Some branches are predicted more accurately with global predictors
• Some branches are predicted better with local predictors
• Key Idea: Combine both local and global predictors and dynamically select the right predictor for the right branch
• The selector is yet another 2-bit predictor with a state machine
• Based on which predictor (local, global or even some mix) was most effective in recent predictions
Predictor Comparison
Predicting Branch Targets
• To avoid branch penalty in 5-stage pipeline, we need to know which address to fetch next instruction from before end of IF stage
• Requires us to know whether the (as-yet undecoded) instruction is a branch and, if so, what the next PC should be
• Solution: Predict the target address for a potential branch during the IF stage
Branch Target Buffer (BTB)
• During IF stage, use PC of current instruction (possible branch) to index into table of predicted target PCs for that branch
• Fetch of predicted target begins at the start of next cycle
Predicting Branch Targets
• Unlike branch prediction behavior, we cannot permit aliasing but must match the PC; otherwise we would fetch predicted targets for non-branch instructions, impacting performance
• If branch is later resolved to be not-taken, remove the BTB entry
• Fetch for predicted-not-taken branch is the same as a non-branch; sequential
• If using a two-bit branch predictor within the BTB
• Can retain the BTB entry but use prediction bits in the table
Branch target Buffer Behavior
Branch Penalty
Instruction in Buffer
Prediction Actual Branch Penalty Cycles
Yes Taken Taken 0
Yes Taken Not Taken 2
No Not Taken Taken 2
No Not Taken Not Taken 0
The above penalty assumes that the branch outcome is being computed in EX stage
Branch Folding
• Store the actual target instruction rather than its PC• Saves a memory fetch cycle => can be leveraged to build a larger branch target
buffer (additional latency compensated by instruction fetch savings)
• Zero-cycle unconditional branches• Branch target buffer signals a hit and provides the target instruction
• Target instruction substituted for current instruction (unconditional branch)
11
Dealing with Indirect Branches
• Indirect branches have multiple potential targets, since address comes from a register, which can have many possible values
• Branch target buffers could be used for indirect branch target prediction
– However, many mispredictions can happen because the BTB can store only one target per branch
• Most indirect branches come from return instructions
Returning from Procedure Calls
• Procedures return to their callers via a Jump instruction• Procedures may be called from different places in a program
• Makes branch target prediction inaccurate if relying on previous return address
JALR R31
JALR R31
JALR R31
………
………
Procedure
JR R31
………
Return Address Stack (RAS)
Key Idea: Cache the most recent return addresses in a small buffer operating as a stack, called return address stack (RAS)
• When “procedure call” occurs, push the return address (which is the Call address + 4) onto the RAS
• When return instruction encountered, pop the address from the RAS (last-in, first-out) and use it as the target
Return Address1
Return Address2
.
.
.
Return Addressn
Return Address Stack (RAS)
Key Idea: Cache the most recent return addresses in a small buffer operating as a stack, called return address stack (RAS)
• When “procedure call” occurs, push the return address (which is the Call address + 4) onto the RAS
• When return instruction encountered, pop the address from the RAS (last-in, first-out) and use it as the target
If the RAS is sufficiently large (i.e., as large as the maximum call depth), it will predict the return addresses perfectly
Speculative Execution
• In high performance pipelines with multiple issue, control dependency becomes the primary bottleneck
• Branch prediction allows the pipeline to partially continue until branch outcome is known
• Instructions continue to be fetched/issued but cannot be executed until branch outcome known
Solution: Speculatively execute instructions based upon predicted branch outcome
Hardware-based Speculation
• Resulting Problem: If branch prediction is wrong:
• must “undo” the effects of wrongly executed instructions
• Must deal with potential exceptions arising from wrongly executed instructions
• Solution: Separate “execution” and “write results” from “commit”
Hardware-based Speculation
• Combines three key ideas:
• Dynamic branch prediction
• Speculation– Allow execution of instructions before control dependences are resolved
– Ability to undo the effects of incorrectly executed instructions
• Dynamic scheduling
• Used in most modern high performance processors
• Relies on extension to Tomasulo’s algorithm
• Separate bypassing from actual instruction completion (“commit”)
• Commit order implemented with “Re-order Buffer”