ECE 486/586 Computer Architecture Lecture # 17web.cecs.pdx.edu/~zeshan/ece586_lec17.pdf ·...
Transcript of ECE 486/586 Computer Architecture Lecture # 17web.cecs.pdx.edu/~zeshan/ece586_lec17.pdf ·...
![Page 1: ECE 486/586 Computer Architecture Lecture # 17web.cecs.pdx.edu/~zeshan/ece586_lec17.pdf · 2019-05-30 · Computer Architecture Lecture # 17 Spring 2019 Portland State University.](https://reader033.fdocuments.in/reader033/viewer/2022042709/5f4f55022afa395c6303498b/html5/thumbnails/1.jpg)
ECE 486/586
Computer Architecture
Lecture # 17
Spring 2019
Portland State University
![Page 2: ECE 486/586 Computer Architecture Lecture # 17web.cecs.pdx.edu/~zeshan/ece586_lec17.pdf · 2019-05-30 · Computer Architecture Lecture # 17 Spring 2019 Portland State University.](https://reader033.fdocuments.in/reader033/viewer/2022042709/5f4f55022afa395c6303498b/html5/thumbnails/2.jpg)
Lecture Topics
• Branch Prediction– Tournament Predictors
– Branch Target Buffer (BTB)
– Return Address Stack (RAS)
• Speculative Execution
Reference:
• Chapter 3: Sections 3.3, 3.6, 3.9
![Page 3: ECE 486/586 Computer Architecture Lecture # 17web.cecs.pdx.edu/~zeshan/ece586_lec17.pdf · 2019-05-30 · Computer Architecture Lecture # 17 Spring 2019 Portland State University.](https://reader033.fdocuments.in/reader033/viewer/2022042709/5f4f55022afa395c6303498b/html5/thumbnails/3.jpg)
Tournament Predictors: Adaptively Combining Local and Global Predictors
• Some branches are predicted more accurately with global predictors
• Some branches are predicted better with local predictors
• Key Idea: Combine both local and global predictors and dynamically select the right predictor for the right branch
• The selector is yet another 2-bit predictor with a state machine
• Based on which predictor (local, global or even some mix) was most effective in recent predictions
![Page 4: ECE 486/586 Computer Architecture Lecture # 17web.cecs.pdx.edu/~zeshan/ece586_lec17.pdf · 2019-05-30 · Computer Architecture Lecture # 17 Spring 2019 Portland State University.](https://reader033.fdocuments.in/reader033/viewer/2022042709/5f4f55022afa395c6303498b/html5/thumbnails/4.jpg)
Predictor Comparison
![Page 5: ECE 486/586 Computer Architecture Lecture # 17web.cecs.pdx.edu/~zeshan/ece586_lec17.pdf · 2019-05-30 · Computer Architecture Lecture # 17 Spring 2019 Portland State University.](https://reader033.fdocuments.in/reader033/viewer/2022042709/5f4f55022afa395c6303498b/html5/thumbnails/5.jpg)
Predicting Branch Targets
• To avoid branch penalty in 5-stage pipeline, we need to know which address to fetch next instruction from before end of IF stage
• Requires us to know whether the (as-yet undecoded) instruction is a branch and, if so, what the next PC should be
• Solution: Predict the target address for a potential branch during the IF stage
![Page 6: ECE 486/586 Computer Architecture Lecture # 17web.cecs.pdx.edu/~zeshan/ece586_lec17.pdf · 2019-05-30 · Computer Architecture Lecture # 17 Spring 2019 Portland State University.](https://reader033.fdocuments.in/reader033/viewer/2022042709/5f4f55022afa395c6303498b/html5/thumbnails/6.jpg)
Branch Target Buffer (BTB)
• During IF stage, use PC of current instruction (possible branch) to index into table of predicted target PCs for that branch
• Fetch of predicted target begins at the start of next cycle
![Page 7: ECE 486/586 Computer Architecture Lecture # 17web.cecs.pdx.edu/~zeshan/ece586_lec17.pdf · 2019-05-30 · Computer Architecture Lecture # 17 Spring 2019 Portland State University.](https://reader033.fdocuments.in/reader033/viewer/2022042709/5f4f55022afa395c6303498b/html5/thumbnails/7.jpg)
Predicting Branch Targets
• Unlike branch prediction behavior, we cannot permit aliasing but must match the PC; otherwise we would fetch predicted targets for non-branch instructions, impacting performance
• If branch is later resolved to be not-taken, remove the BTB entry
• Fetch for predicted-not-taken branch is the same as a non-branch; sequential
• If using a two-bit branch predictor within the BTB
• Can retain the BTB entry but use prediction bits in the table
![Page 8: ECE 486/586 Computer Architecture Lecture # 17web.cecs.pdx.edu/~zeshan/ece586_lec17.pdf · 2019-05-30 · Computer Architecture Lecture # 17 Spring 2019 Portland State University.](https://reader033.fdocuments.in/reader033/viewer/2022042709/5f4f55022afa395c6303498b/html5/thumbnails/8.jpg)
Branch target Buffer Behavior
![Page 9: ECE 486/586 Computer Architecture Lecture # 17web.cecs.pdx.edu/~zeshan/ece586_lec17.pdf · 2019-05-30 · Computer Architecture Lecture # 17 Spring 2019 Portland State University.](https://reader033.fdocuments.in/reader033/viewer/2022042709/5f4f55022afa395c6303498b/html5/thumbnails/9.jpg)
Branch Penalty
Instruction in Buffer
Prediction Actual Branch Penalty Cycles
Yes Taken Taken 0
Yes Taken Not Taken 2
No Not Taken Taken 2
No Not Taken Not Taken 0
The above penalty assumes that the branch outcome is being computed in EX stage
![Page 10: ECE 486/586 Computer Architecture Lecture # 17web.cecs.pdx.edu/~zeshan/ece586_lec17.pdf · 2019-05-30 · Computer Architecture Lecture # 17 Spring 2019 Portland State University.](https://reader033.fdocuments.in/reader033/viewer/2022042709/5f4f55022afa395c6303498b/html5/thumbnails/10.jpg)
Branch Folding
• Store the actual target instruction rather than its PC• Saves a memory fetch cycle => can be leveraged to build a larger branch target
buffer (additional latency compensated by instruction fetch savings)
• Zero-cycle unconditional branches• Branch target buffer signals a hit and provides the target instruction
• Target instruction substituted for current instruction (unconditional branch)
![Page 11: ECE 486/586 Computer Architecture Lecture # 17web.cecs.pdx.edu/~zeshan/ece586_lec17.pdf · 2019-05-30 · Computer Architecture Lecture # 17 Spring 2019 Portland State University.](https://reader033.fdocuments.in/reader033/viewer/2022042709/5f4f55022afa395c6303498b/html5/thumbnails/11.jpg)
11
Dealing with Indirect Branches
• Indirect branches have multiple potential targets, since address comes from a register, which can have many possible values
• Branch target buffers could be used for indirect branch target prediction
– However, many mispredictions can happen because the BTB can store only one target per branch
• Most indirect branches come from return instructions
![Page 12: ECE 486/586 Computer Architecture Lecture # 17web.cecs.pdx.edu/~zeshan/ece586_lec17.pdf · 2019-05-30 · Computer Architecture Lecture # 17 Spring 2019 Portland State University.](https://reader033.fdocuments.in/reader033/viewer/2022042709/5f4f55022afa395c6303498b/html5/thumbnails/12.jpg)
Returning from Procedure Calls
• Procedures return to their callers via a Jump instruction• Procedures may be called from different places in a program
• Makes branch target prediction inaccurate if relying on previous return address
JALR R31
JALR R31
JALR R31
………
………
Procedure
JR R31
………
![Page 13: ECE 486/586 Computer Architecture Lecture # 17web.cecs.pdx.edu/~zeshan/ece586_lec17.pdf · 2019-05-30 · Computer Architecture Lecture # 17 Spring 2019 Portland State University.](https://reader033.fdocuments.in/reader033/viewer/2022042709/5f4f55022afa395c6303498b/html5/thumbnails/13.jpg)
Return Address Stack (RAS)
Key Idea: Cache the most recent return addresses in a small buffer operating as a stack, called return address stack (RAS)
• When “procedure call” occurs, push the return address (which is the Call address + 4) onto the RAS
• When return instruction encountered, pop the address from the RAS (last-in, first-out) and use it as the target
Return Address1
Return Address2
.
.
.
Return Addressn
![Page 14: ECE 486/586 Computer Architecture Lecture # 17web.cecs.pdx.edu/~zeshan/ece586_lec17.pdf · 2019-05-30 · Computer Architecture Lecture # 17 Spring 2019 Portland State University.](https://reader033.fdocuments.in/reader033/viewer/2022042709/5f4f55022afa395c6303498b/html5/thumbnails/14.jpg)
Return Address Stack (RAS)
Key Idea: Cache the most recent return addresses in a small buffer operating as a stack, called return address stack (RAS)
• When “procedure call” occurs, push the return address (which is the Call address + 4) onto the RAS
• When return instruction encountered, pop the address from the RAS (last-in, first-out) and use it as the target
If the RAS is sufficiently large (i.e., as large as the maximum call depth), it will predict the return addresses perfectly
![Page 15: ECE 486/586 Computer Architecture Lecture # 17web.cecs.pdx.edu/~zeshan/ece586_lec17.pdf · 2019-05-30 · Computer Architecture Lecture # 17 Spring 2019 Portland State University.](https://reader033.fdocuments.in/reader033/viewer/2022042709/5f4f55022afa395c6303498b/html5/thumbnails/15.jpg)
Speculative Execution
• In high performance pipelines with multiple issue, control dependency becomes the primary bottleneck
• Branch prediction allows the pipeline to partially continue until branch outcome is known
• Instructions continue to be fetched/issued but cannot be executed until branch outcome known
Solution: Speculatively execute instructions based upon predicted branch outcome
![Page 16: ECE 486/586 Computer Architecture Lecture # 17web.cecs.pdx.edu/~zeshan/ece586_lec17.pdf · 2019-05-30 · Computer Architecture Lecture # 17 Spring 2019 Portland State University.](https://reader033.fdocuments.in/reader033/viewer/2022042709/5f4f55022afa395c6303498b/html5/thumbnails/16.jpg)
Hardware-based Speculation
• Resulting Problem: If branch prediction is wrong:
• must “undo” the effects of wrongly executed instructions
• Must deal with potential exceptions arising from wrongly executed instructions
• Solution: Separate “execution” and “write results” from “commit”
![Page 17: ECE 486/586 Computer Architecture Lecture # 17web.cecs.pdx.edu/~zeshan/ece586_lec17.pdf · 2019-05-30 · Computer Architecture Lecture # 17 Spring 2019 Portland State University.](https://reader033.fdocuments.in/reader033/viewer/2022042709/5f4f55022afa395c6303498b/html5/thumbnails/17.jpg)
Hardware-based Speculation
• Combines three key ideas:
• Dynamic branch prediction
• Speculation– Allow execution of instructions before control dependences are resolved
– Ability to undo the effects of incorrectly executed instructions
• Dynamic scheduling
• Used in most modern high performance processors
• Relies on extension to Tomasulo’s algorithm
• Separate bypassing from actual instruction completion (“commit”)
• Commit order implemented with “Re-order Buffer”