ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in...
-
date post
21-Dec-2015 -
Category
Documents
-
view
221 -
download
2
Transcript of ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in...
![Page 1: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/1.jpg)
ISLPED’03 1
Reducing Reorder Buffer Complexity Through Selective Operand Caching
*supported in part by DARPA through the PAC-C program and NSF
Gurhan Kucuk, Dmitry Ponomarev, Oguz Ergin, Kanad GhoseDepartment of Computer Science
State University of New YorkBinghamton, NY 13902-6000
http://www.cs.binghamton.edu/~lowpower
International Symposium on Low Power Electronics and Design (ISLPED’03), August 26 th 2003
![Page 2: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/2.jpg)
ISLPED’03 2
Outline
Reorder Buffer (ROB) complexitiesMotivation for the low-complexity ROBLow-complexity ROB (ICS’02)Improving the design using short-lived valuesResultsConcluding remarks
![Page 3: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/3.jpg)
ISLPED’03 3
P6 Style Superscalar Datapath
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
![Page 4: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/4.jpg)
ISLPED’03 4
ROB Port Requirements for a W-way CPU
ROB
WritebackW write portsto write results
Dispatch/Issue2W read ports
to read the source operands
Decode/DispatchW write portsto setup entries
CommitW read portsfor instruction commitment
![Page 5: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/5.jpg)
ISLPED’03 5
Where are the Source Values Coming From?
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
12
3
![Page 6: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/6.jpg)
ISLPED’03 6
Where are the Source Values Coming From ?
0%
20%
40%
60%
80%
100%
Forwarding ARF ROB
96-entry ROB, 4-way processorSPEC2K Benchmarks
62% 32%32% 6%
![Page 7: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/7.jpg)
ISLPED’03 7
How Efficiently are the Ports Used ?
ROB
WritebackW write ports
To write results
Dispatch/Issue2W read ports
to read the source operands
Decode/DispatchW write portsto setup entries
CommitW read portsfor instruction commitment
6%
![Page 8: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/8.jpg)
ISLPED’03 8
Our Solution: Elimination of Read Ports
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
12
3
![Page 9: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/9.jpg)
ISLPED’03 9
Our Solution: Elimination of Read Ports
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
12
3
![Page 10: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/10.jpg)
ISLPED’03 10
Our Solution: Elimination of Read Ports
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
1
3
ROB
![Page 11: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/11.jpg)
ISLPED’03 11
Comparison of ROB Bitcells (0.18µ, TSMC)
Layout of a 32-ported SRAM bitcell
Layout of a 16-ported SRAM bitcell
Area Reduction – 71%
Shorter bit and wordlines
![Page 12: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/12.jpg)
ISLPED’03 12
Completely Eliminating the Source Read Ports on the ROB
The Problem: Issue of instructions that require a value stored in the ROB will stall
Solutions:
Forward the value to the waiting instruction at the time of committing the value: LATE FORWARDING
![Page 13: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/13.jpg)
ISLPED’03 13
Late Forwarding: Use the Normal Forwarding Buses!
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
Result/status forwarding buses:
![Page 14: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/14.jpg)
ISLPED’03 14
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
Result/status forwarding buses:
Late Forwarding: Use the Normal Forwarding Buses!
![Page 15: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/15.jpg)
ISLPED’03 15
Improving Performance
Cache recently generated values in a set of RETENTION LATCHES (RL)
Retention Latches are SMALL and FAST
Only 8 to 16 latches needed in the set
Entire set has 1 or 2 read ports
![Page 16: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/16.jpg)
ISLPED’03 16
Datapath with the Retention Latches
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
Architectural Register File
![Page 17: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/17.jpg)
ISLPED’03 17
Datapath with the Retention Latches
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
RETENTION LATCHES
ROB
![Page 18: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/18.jpg)
ISLPED’03 18
Retention Latch Management Strategies
FIFO
8 entry RL: 42% hit rate
16 entry RL: 55% hit rate
LRU
8 entry RL: 56% hit rate
16 entry RL: 62% hit rate
Random Replacement
Worse performance than FIFO
![Page 19: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/19.jpg)
ISLPED’03 19
Advantages of Using Retention Latches
Reduces energy dissipation in the ROB – avoids creating a localized hot spot
Reduces associated performance losses
Reduces ROB complexity – smaller floor plan, easier validation
![Page 20: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/20.jpg)
ISLPED’03 20
Improving Retention Latch Management
PROBLEM: All generated results, irrespective of whether they could be potentially read from the RLs, are written into the latches unconditionally
CONSEQUENCE: The array of RLs is not utilized efficiently and performance loss is still noticeable
SOLUTION: We identify the values which are never going to be read after the cycle of their generation and avoid writing of these values into the RLs
![Page 21: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/21.jpg)
ISLPED’03 21
Our definition: a value is short-lived if the destination register is renamed by the time of the result generation
Identified one cycle before the result writeback
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4RENAMER
Short-Lived Values
![Page 22: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/22.jpg)
ISLPED’03 22
AVOID WRITING SHORT-LIVED VALUES INTO THE RETENTION LATCHES
Reasons:
Short-lived values are forwarded directly to all potential consumers in the issue queue
No instruction will ever consume a short-lived value from the retention latches
Results:
Increased RL hit ratios and better overall performance
Key Idea: Do not cache short-lived values
![Page 23: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/23.jpg)
ISLPED’03 23
0
10
20
30
40
50
60
70
80
90
100
96-entry ROB, 4-way processor
The Good News : 80%+ of the Values are Short-Lived
%
![Page 24: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/24.jpg)
ISLPED’03 24
Maintain the bit-vector Renamed
Set by the Renamer at the time of renaming
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 31 0
2 2 1
3 3 1
4 4 1
5 32 0
LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
31
1
Renamed
Identifying Short-Lived Values
![Page 25: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/25.jpg)
ISLPED’03 25
Maintain the bit-vector Renamed
Set by the Renamer at the time of renaming
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 33 0
2 2 1
3 3 1
4 4 1
5 32 0
LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
31
1
Renamed
Identifying Short-Lived Values
![Page 26: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/26.jpg)
ISLPED’03 26
Renamed bit is checked one cycle before writeback
Value produced by LOAD is short-lived because Renamed [31]=1
LOAD P31, R2, 100 SUB P32, P31, R3 ADD P33, P32, R4
LOAD R1, R2, 100 SUB R5, R1, R3 ADD R1, R5, R4
31
1
Renamed
Identifying Short-Lived Values
![Page 27: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/27.jpg)
ISLPED’03 27
Hit Ratios to Retention Latches
0
20
40
60
80
100
8 original FIFO RLs 8 optimized FIFO RLs
46% 73%73%
0
20
40
60
80
100
Hit
Rat
ios
bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr
applu apsi art equake mesa mgrid swim wupwise FP Avg.
Average Hit Ratio:
![Page 28: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/28.jpg)
ISLPED’03 28
0
1
2
3
Baseline 8 original RLs 8 optimized RLs 4 optimized RLs 2 optimized RLs
Experimental Results: Effect on Performance
IPC
0
1
2
3
1.7%1.7%1.7% 0.5% 1.1%
applu apsi art equake mesa mgrid swim wupwise FP Avg.
bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr
Avg. IPC Drop:
![Page 29: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/29.jpg)
ISLPED’03 29
0
300
600
900
1200
0
300
600
900
Baseline 8 optimized RLs 4 optimized RLs 2 optimized RLs
Experimental Results: Effect on ROB Power
Energy (pJ)
15.9%13.7%13.7% 15.0%15.0%
applu apsi art equake mesa mgrid swim wupwise FP Avg.
bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr
Avg. Savings:
![Page 30: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/30.jpg)
ISLPED’03 30
Conclusions
We proposed a mechanism to further improve the performance and reduce the complexity of a processor that uses retention latches and eliminates the ROB source read portsThe idea is to avoid caching the short-lived result values in the retention latchesBoth retention latch hit ratio and the overall performance improvedAlternatively, fewer retention latches can be used with the same performance
![Page 31: ISLPED’03 1 Reducing Reorder Buffer Complexity Through Selective Operand Caching *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk,](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d625503460f94a44077/html5/thumbnails/31.jpg)
ISLPED’03 31
THANK YOU !
*supported in part by DARPA through the PAC-C program and NSF
LOW POWER RESEARCH GROUP Department of Computer Science
State University of New YorkBinghamton, NY 13902-6000
http://www.cs.binghamton.edu/~lowpower
International Symposium on Low Power Electronics and Design (ISLPED’03), August 27th 2003